Multi-row merge functions collect_set, collect_list, concat_ws in sql

Take N fields from the database, and then combine them together and use "," to separate the display. At first, I thought of using CONCAT() to process it, but it was troublesome. Unexpectedly, CONCAT_WS() was mentioned in the manual, which is very easy to use.

CONCAT_WS(separator, str1, str2,...)

It is a special form of CONCAT(). The first parameter is the separator between the remaining parameters. The separator can be the same string as the remaining parameters. If the separator is NULL, the return value will also be NULL and this function will jump. Any NULL and empty strings after the delimiter parameter, the delimiter will be added between the strings being concatenated.

A simple example is as follows:

mysql> SELECT CONCAT_WS(",","First name","Second name","Last Name");
       -> 'First name,Second name,Last Name'
mysql> SELECT CONCAT_WS(",","First name",NULL,"Last Name");
       -> 'First name,Last Name'

The collect-related functions in Hive include collect_list and collect_set.

They both convert a column in the grouping into an array and return it. The difference is that collect_list does not remove duplicates but collect_set does.

Do a simple experiment to deepen your understanding, and create an experimental table to store the records of users' daily on-demand videos:

create table t_visit_video (
    username string,
    video_name string
) partitioned by (day string)
row format delimited fields terminated by ',';

Create a test data file on the local file system:

张三,大唐双龙传
李四,天下无贼
张三,神探狄仁杰
李四,霸王别姬
李四,霸王别姬
王五,机器人总动员
王五,放牛班的春天
王五,盗梦空间

Load data into Hive table:

load data local inpath '/root/hive/visit.data' into table t_visit_video partition (day='20180516');

Group by user and retrieve the names of all videos watched by each user every day:

select username, collect_list(video_name) from t_visit_video group by username ;

张三  ["大唐双龙传","神探狄仁杰"]
李四  ["天下无贼","霸王别姬","霸王别姬"]
…………

However, there is a problem with the above query results. Because Farewell My Concubine is so good, Li Si watched it twice. This directly leads to duplicates in the list of watched videos, so duplication should be added and collect_set should be used. The difference between it and collect_list is that it will remove duplicates:

select username, collect_set(video_name) from t_visit_video group by username;

张三  ["大唐双龙传","神探狄仁杰"]
李四  ["天下无贼","霸王别姬"]
…………

Farewell My Concubine only appears once in Li Si's viewing records, achieving the effect of de-duplication.

Break through group by restrictions

You can also use collect to break through the restrictions of group by. In Hive, when querying group by, it is required that the columns that appear after select must appear after group by, that is, the select column must be the column used as the basis for grouping, but some Sometimes we want to group according to A and then randomly take out a B in each group. In this experiment, we group by user, and then randomly pick out the name of a video he has watched:

select username, collect_list(video_name)[0] from t_visit_video group by username;

张三  大唐双龙传
李四 天下无贼
…………

video_name is not a grouping column, and the data in this column can still be retrieved.

Excerpts summarized in:

HTTPS://wuwuwu.cn blog on.com/loco有/archive/2006/10/28/542751.HTML

HTTPS://wuwuwu.cn blog on.com/Cao Cao11001100/fear/9043946.HTML

Multi-row merge functions collect_set, collect_list, concat_ws in sql

Leave a Reply

Categories

Tags