千家信息网

Hive内部函数简介及查询语法

发表于:2024-09-21 作者:千家信息网编辑
千家信息网最后更新 2024年09月21日,1.Hive内置函数:在Hive中 系统给我们内置了很多函数 具体参考官方地址看下官网给我们的介绍:SHOW FUNCTIONS; --查看所有内置函数DESCRIBE FUNCTION ; --查看
千家信息网最后更新 2024年09月21日Hive内部函数简介及查询语法
1.Hive内置函数:

在Hive中 系统给我们内置了很多函数 具体参考官方地址

  • 看下官网给我们的介绍:
SHOW FUNCTIONS; --查看所有内置函数DESCRIBE FUNCTION ; --查看某个函数的描述DESCRIBE FUNCTION EXTENDED ; --查看某个函数的具体使用方法
hive> DESCRIBE FUNCTION case;OKCASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END - When a = b, returns c; when a = d, return e; else return fTime taken: 0.006 seconds, Fetched: 1 row(s)hive> DESCRIBE FUNCTION EXTENDED case;OKCASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END - When a = b, returns c; when a = d, return e; else return fExample: SELECT CASE deptno   WHEN 1 THEN Engineering   WHEN 2 THEN Finance   ELSE admin END, CASE zone   WHEN 7 THEN Americas   ELSE Asia-Pac END FROM emp_detailsTime taken: 0.008 seconds, Fetched: 13 row(s)# DESCRIBE 可简写为deschive> desc FUNCTION EXTENDED case;OKCASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END - When a = b, returns c; when a = d, return e; else return fExample: SELECT CASE deptno   WHEN 1 THEN Engineering   WHEN 2 THEN Finance   ELSE admin END, CASE zone   WHEN 7 THEN Americas   ELSE Asia-Pac END FROM emp_detailsTime taken: 0.009 seconds, Fetched: 13 row(s)

下面我们了解下常用函数的使用方法:

# 为了方便测试 我们创建常用的dual表hive> create table dual(x string);OKTime taken: 0.11 secondshive> insert into table dual values('');Query ID = hadoop_20180702100505_f0566585-06b2-4c53-910a-b6a58791fc2dTotal jobs = 3Launching Job 1 out of 3...OKTime taken: 29.535 secondshive> select * from dual;OKTime taken: 0.147 seconds, Fetched: 1 row(s)# 测试当前时间hive> select current_date from dual;OK2018-07-02Time taken: 0.111 seconds, Fetched: 1 row(s)# 测试当前时间戳hive> select current_timestamp from dual;OK2018-07-02 15:03:28.919Time taken: 0.117 seconds, Fetched: 1 row(s)# 测试substr函数 用于截取字符串hive> desc function extended substr;OKsubstr(str, pos[, len]) - returns the substring of str that starts at pos and is of length len orsubstr(bin, pos[, len]) - returns the slice of byte array that starts at pos and is of length lenSynonyms: substringpos is a 1-based index. If pos<0 the starting position is determined by counting backwards from the end of str.Example:   > SELECT substr('Facebook', 5) FROM src LIMIT 1;  'book'  > SELECT substr('Facebook', -5) FROM src LIMIT 1;  'ebook'  > SELECT substr('Facebook', 5, 1) FROM src LIMIT 1;  'b'Time taken: 0.016 seconds, Fetched: 10 row(s)hive> SELECT substr('helloworld',-5) FROM dual;OKworldTime taken: 0.171 seconds, Fetched: 1 row(s)hive> SELECT substr('helloworld',5) FROM dual;OKoworldTime taken: 0.12 seconds, Fetched: 1 row(s)hive> SELECT substr('helloworld',5,3) FROM dual;OKowoTime taken: 0.142 seconds, Fetched: 1 row(s)# 测试函数concat 用于将字符连接起来hive> desc function extended concat_ws;OKconcat_ws(separator, [string | array(string)]+) - returns the concatenation of the strings separated by the separator.Example:  > SELECT concat_ws('.', 'www', array('facebook', 'com')) FROM src LIMIT 1;  'www.facebook.com'Time taken: 0.019 seconds, Fetched: 4 row(s)hive> select concat_ws(".","192","168","199","151") from dual;OK192.168.199.151Time taken: 0.152 seconds, Fetched: 1 row(s)# 测试函数split 用于拆分hive> desc function extended split;OKsplit(str, regex) - Splits str around occurances that match regexExample:  > SELECT split('oneAtwoBthreeC', '[ABC]') FROM src LIMIT 1;  ["one", "two", "three"]Time taken: 0.021 seconds, Fetched: 4 row(s)hive> select split("192.168.199.151","\\.") from dual;OK["192","168","199","151"]Time taken: 0.169 seconds, Fetched: 1 row(s)
2.Hive查询语法:
  • 简单select语法:
    # 简单select语法hive> select * from emp where deptno=10;OK7782    CLARK   MANAGER 7839    1981-6-9        2450.0  NULL    107839    KING    PRESIDENT       NULL    1981-11-17      5000.0  NULL    107934    MILLER  CLERK   7782    1982-1-23       1300.0  NULL    10Time taken: 0.899 seconds, Fetched: 3 row(s)hive> select * from emp where empno <= 7800;OK7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    207499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   307521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   307566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    207654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  307698    BLAKE   MANAGER 7839    1981-5-1        2850.0  NULL    307782    CLARK   MANAGER 7839    1981-6-9        2450.0  NULL    107788    SCOTT   ANALYST 7566    1987-4-19       3000.0  NULL    20Time taken: 0.277 seconds, Fetched: 8 row(s)hive> select * from emp where salary between 1000 and 1500;OK7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   307654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  307844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     307876    ADAMS   CLERK   7788    1987-5-23       1100.0  NULL    207934    MILLER  CLERK   7782    1982-1-23       1300.0  NULL    10Time taken: 0.187 seconds, Fetched: 5 row(s)hive> select * from emp limit 5;OK7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    207499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   307521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   307566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    207654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30Time taken: 0.154 seconds, Fetched: 5 row(s)hive> select * from emp where empno in(7566,7499);OK7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   307566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    20Time taken: 0.153 seconds, Fetched: 2 row(s)hive> select * from emp where comm is not null;OK7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   307521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   307654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  307844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30Time taken: 0.291 seconds, Fetched: 4 row(s)
  • 聚合函数及分组函数:
# 聚合函数及分组函数#  max/min/count/sum/avg 特点:多进一出,进来很多条记录出去只有一条记录# 查询部门编号为10的有多少条记录hive> select count(1) from emp where deptno=10;Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ceTotal jobs = 1...OK3Time taken: 38.951 seconds, Fetched: 1 row(s)# 求最大工资,最小工资,平均工资,工资的和hive> select max(salary),min(salary),avg(salary),sum(salary) from emp;Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ceTotal jobs = 1...OK5000.0  800.0   2073.214285714286       29025.0Time taken: 23.748 seconds, Fetched: 1 row(s)# 分组函数 group by# 求部门的平均工资# 注:select中出现的字段,如果没有出现在组函数/聚合函数中,必须出现在group by里面,否则就会产生报错hive> select deptno,avg(salary) from emp group by deptno;Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ceTotal jobs = 1...OK10      2916.666666666666520      2175.030      1566.6666666666667Time taken: 36.502 seconds, Fetched: 3 row(s)# 求每个部门(deptno)、工作岗位(job)的最高工资(salary)hive> select deptno,job,max(salary) from emp group by deptno,job;Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ceTotal jobs = 1...OK10      CLERK   1300.010      MANAGER 2450.010      PRESIDENT       5000.020      ANALYST 3000.020      CLERK   1100.020      MANAGER 2975.030      CLERK   950.030      MANAGER 2850.030      SALESMAN        1600.0Time taken: 36.096 seconds, Fetched: 9 row(s)# 查询平均工资大于2000的部门(使用having子句限定分组查询)hive> select deptno,avg(salary) from emp group by deptno having avg(salary) >2000;Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ceTotal jobs = 1...OK10      2916.666666666666520      2175.0Time taken: 24.71 seconds, Fetched: 2 row(s)# case when then end(不会跑mr)hive> select ename, salary,     > case    > when salary > 1 and salary <= 1000 then 'LOWER'    > when salary > 1000 and salary <= 2000 then 'MIDDLE'    > when salary > 2000 and salary <= 4000 then 'HIGH'    > ELSE 'HIGHEST'    > end    > from emp;OKSMITH   800.0   LOWERALLEN   1600.0  MIDDLEWARD    1250.0  MIDDLEJONES   2975.0  HIGHMARTIN  1250.0  MIDDLEBLAKE   2850.0  HIGHCLARK   2450.0  HIGHSCOTT   3000.0  HIGHKING    5000.0  HIGHESTTURNER  1500.0  MIDDLEADAMS   1100.0  MIDDLEJAMES   950.0   LOWERFORD    3000.0  HIGHMILLER  1300.0  MIDDLETime taken: 0.096 seconds, Fetched: 14 row(s)
  • 多表join查询:
# 创建测试表hive> create table a(    > id int, name string    > ) row format delimited fields terminated by '\t';OKTime taken: 0.311 secondshive> create table b(    > id int, age int    > ) row format delimited fields terminated by '\t';OKTime taken: 0.142 seconds# insert或load数据 最后表数据如下hive> select * from a;OK1       zhangsan2       lisi3       wangwuhive> select * from b;OK1       202       304       40Time taken: 0.2 seconds, Fetched: 3 row(s)# 内连接 inner join = join 仅列出表1和表2符合连接条件的数据hive> select a.id,a.name,b.age from a join b on a.id=b.id;Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ceTotal jobs = 1...OK1       zhangsan        202       lisi    30Time taken: 24.415 seconds, Fetched: 2 row(s)# 左外连接(left join) 以左边的为基准,左边的数据全部数据全部出现,如果没有出现就赋null值hive> select a.id,a.name,b.age from a left join b on a.id=b.id;Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ceTotal jobs = 1...OK1       zhangsan        202       lisi    303       wangwu  NULLTime taken: 26.218 seconds, Fetched: 3 row(s)# 右外连接(right join) 以右表为基准hive> select a.id,a.name,b.age from a right join b on a.id=b.id;Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ceTotal jobs = 1...OK1       zhangsan        202       lisi    30NULL    NULL    40Time taken: 24.027 seconds, Fetched: 3 row(s)# 全连接(full join)相当于表1和表2的数据都显示,如果没有对应的数据,则显示Null.hive> select a.id,a.name,b.age from a full join b on a.id=b.id;Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ceTotal jobs = 1...OK1       zhangsan        202       lisi    303       wangwu  NULLNULL    NULL    40Time taken: 32.94 seconds, Fetched: 4 row(s)# 笛卡尔积(cross join) 没有连接条件 会针对表1和表2的每条数据做连接hive> select a.id,a.name,b.age from a cross join b;Warning: Map Join MAPJOIN[7][bigTable=a] in task 'Stage-3:MAPRED' is a cross productQuery ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ceTotal jobs = 1...OK1       zhangsan        201       zhangsan        301       zhangsan        402       lisi    202       lisi    302       lisi    403       wangwu  203       wangwu  303       wangwu  40Time taken: 29.825 seconds, Fetched: 9 row(s)
3.利用Hive sql实现wordcount:
# 创建表 加载测试数据hive> create table hive_wc(sentence string);OKTime taken: 0.149 seconds[hadoop@hadoop000 ~]$ cat hive-wc.txthello,world,welcomehello,welcomehive> load data local inpath '/home/hadoop/hive-wc.txt' into table hive_wc;Loading data to table default.hive_wcTable default.hive_wc stats: [numFiles=1, totalSize=34]OKTime taken: 0.729 secondshive> select * from hive_wc;OKhello,world,welcomehello,welcomeTime taken: 0.13 seconds, Fetched: 2 row(s)# 获取每个单词 利用split分割hive> select split(sentence,",") from hive_wc;OK["hello","world","welcome"]["hello","welcome"]Time taken: 0.163 seconds, Fetched: 2 row(s)# explode把数组转成多行 结合split使用如下hive> select explode(split(sentence,",")) from hive_wc;OKhelloworldwelcomehellowelcomeTime taken: 0.068 seconds, Fetched: 5 row(s)# 做group by操作 一条语句即可实现wordcount统计hive> select word, count(1) as c     > from (select explode(split(sentence,",")) as word from hive_wc) t    > group by word ;Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ceTotal jobs = 1...OKhello   2welcome 2world   1Time taken: 34.168 seconds, Fetched: 3 row(s)
0