千家信息网

hive常用小技巧有哪些

发表于:2025-02-06 作者:千家信息网编辑
千家信息网最后更新 2025年02月06日,这篇文章将为大家详细讲解有关hive常用小技巧有哪些,小编觉得挺实用的,因此分享给大家做个参考,希望大家阅读完这篇文章后可以有所收获。1. parse_url如果查不到会返回 NULLparse_ur
千家信息网最后更新 2025年02月06日hive常用小技巧有哪些

这篇文章将为大家详细讲解有关hive常用小技巧有哪些,小编觉得挺实用的,因此分享给大家做个参考,希望大家阅读完这篇文章后可以有所收获。

1. parse_url如果查不到会返回 NULL

parse_url用来解析url中的数据。常用的就是解析HOST和QUERY。

string

parse_url(string urlString, string partToExtract [, string keyToExtract])

Returns the specified part from the URL. Valid values for partToExtract include HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO. e.g. parse_url('http://facebook.com/path2/p.php?k1=v1&k2=v2#Ref1', 'HOST') returns 'facebook.com'. Also a value of a particular key in QUERY can be extracted by providing the key as the third argument, e.g. parse_url('http://facebook.com/path2/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k1') returns 'v1'.

select parse_url('http://www.meilishuo.com/guang/hot','QUERY','page') from class_method_map where parse_url('http://www.meilishuo.com/guang/hot','QUERY','page')<=0 limit 10;

Total MapReduce CPU Time Spent: 2 seconds 440 msec
OK
Time taken: 38.136 seconds

select parse_url('http://www.meilishuo.com/guang/hot','QUERY','page') from class_method_map where parse_url('http://www.meilishuo.com/guang/hot','QUERY','page') is NULL limit 10;
Total MapReduce CPU Time Spent: 2 seconds 660 msec
OK
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
Time taken: 40.133 seconds

select parse_url('http://www.meilishuo.com/guang/hot','QUERY','page') from class_method_map where parse_url('http://www.meilishuo.com/guang/hot','QUERY','page') is not NULL limit 10;
Total MapReduce CPU Time Spent: 2 seconds 530 msec
OK
Time taken: 34.121 seconds

2. 默认分区会导致查询出错,删掉时要确认表中(show partitions visitlogs)是否还有。尤其是使用外部表时要注意删除数据。

ALTER TABLE visitlogs DROP PARTITION (dt='2012-12-25', vhour='__HIVE_DEFAULT_PARTITION__');

hive>
> select sessidmodex(sessid,10), count(*), count(distinct sessid), count(distinct visitip) from visitlogs where ((dt='2012-11-10' and vhour>=13) or (dt='2012-11-11' and vhour<13) )
> AND ((class_name='goods' AND method_name='goods_poster' and uri like '%page=0%') OR (class_name='goods' AND method_name='hot' and parse_url(concat('http://www.meilishuo.com',uri),'QUERY','page') is NULL) ) AND not is_spam(dt,sessid,'SESSID') group by sessidmodex(sessid, 10);
FAILED: Hive Internal Error: java.lang.NullPointerException(null)
java.lang.NullPointerException
at org.apache.hadoop.hive.ql.optimizer.pcr.PcrExprProcFactory.opAnd(PcrExprProcFactory.java:128)
at org.apache.hadoop.hive.ql.optimizer.pcr.PcrExprProcFactory$GenericFuncExprProcessor.process(PcrExprProcFactory.java:267)
at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:125)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102)
at org.apache.hadoop.hive.ql.optimizer.pcr.PcrExprProcFactory.walkExprTree(PcrExprProcFactory.java:450)
at org.apache.hadoop.hive.ql.optimizer.pcr.PcrOpProcFactory$FilterPCR.process(PcrOpProcFactory.java:149)
at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:125)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102)
at org.apache.hadoop.hive.ql.optimizer.pcr.PartitionConditionRemover.transform(PartitionConditionRemover.java:78)
at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:87)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7306)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:430)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:337)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:889)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

3. Array和Map的常用方法:

A[n] A is an Array and n is an int Returns the nth element in the array A. The first element has index 0 e.g. if A is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar' M[key] M is a Map and key has type K Returns the value corresponding to the key in the map e.g. if M is a map comprising of {'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'} then M['all'] returns 'foobar'  array map_keys(Map) Returns an unordered array containing the keys of the input map array map_values(Map) Returns an unordered array containing the values of the input map boolean array_contains(Array, value) Returns TRUE if the array contains value

简单demo:查询qzone渠道来过的所有用户的记录
select * from user_session_stat where dt='2012-10-15' and array_contains(map_keys(market_from),'tx_qzone') limit 10;

visitips格式{"172.0.0.1":100,"172.0.0.1":20,"172.0.0.2":5} #IP:访问量, 按访问量排序

map_keys(visitips)[0]:获取访问量最多的IP

4. explode 打散数组和字典
select explode(map_keys(market_from)) as cc from user_session_stat where dt='2012-11-28' and size(map_keys(market_from))>1 limit 2;

select tmp.cc,count(*) from (select explode(map_keys(market_from)) as cc from user_session_stat where dt='2012-11-28' and size(map_keys(market_from))>1 limit 10) tmp group by cc;

详细:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView

第一区间:9.8~9.18
第二区间:10.2~10.12
第三区间:10.21~11.01
第四区间:11.07~11.17
select tmp2.cc,count(*) from (
select explode(map_keys(uss.market_from)) as cc from (
select sessid from user_session_stat where dt='2012-09-18' and is_spam=0 and sub_channel='norefer') tmp join user_session_stat uss on tmp.sessid=uss.sessid where uss.dt>='2012-09-08' and uss.dt<='2012-09-17' and uss.is_spam=0 ) tmp2 group by tmp2.cc;

PS:
select explode(map_keys(uss.market_from)) as cc,uss.sessid from (
select sessid from user_session_stat where dt='2012-11-01' and is_spam=0 and sub_channel='norefer') tmp join user_session_stat uss on tmp.sessid=uss.sessid where uss.dt='2012-10-30' and uss.dt<='2012-10-31' and uss.is_spam=0
不支持:Error in semantic analysis: 1:51 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'sessid'

5. HIVE 的 LATERAL VIEW
===============================

http://stackoverflow.com/questions/11373543/explode-the-array-of-struct-in-hive

This is the below Hive Table

CREATE EXTERNAL TABLE IF NOT EXISTS SampleTable(USER_ID BIGINT,NEW_ITEM ARRAY>)And this is the data in the above table-1015826235     [{"product_id":220003038067,"timestamps":"1340321132000"},{"product_id":300003861266,"timestamps":"1340271857000"}]Is there any way I can get the below output from the HiveQL after exploding the array?**USER_ID**  |  **PRODUCT_ID**  |   **TIMESTAMPS** ------------+------------------+----------------1015826235      220003038067       13403211320001015826235      300003861266       1340271857000SELECT   user_id,   prod_and_ts.product_id as product_id,   prod_and_ts.timestamps as timestampsFROM    SampleTable    LATERAL VIEW explode(new_item) exploded_table as prod_and_ts;======================


select * from user_session_stat LATERAL VIEW explode(map_keys(market_from)) channel_cc as prod_and_ts where dt='2012-12-23' limit 10;

000048693e1e5f7cd12113011144502a 2012-11-30 ["0"] ["","tq_shoppingbag1105d_121223","out_link"] {"123.171.149.104":21,"182.35.12.232":9} /paipaiguang/%E7%BE%BD%E7%BB%92%E6%9C%8D?PTAG=20036.9.213 Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER 8 organic norefer 0 SHANDONG JINAN 3079 {"norefer":1,"tx_weibo":1} 2012-12-23 norefer
000048693e1e5f7cd12113011144502a 2012-11-30 ["0"] ["","tq_shoppingbag1105d_121223","out_link"] {"123.171.149.104":21,"182.35.12.232":9} /paipaiguang/%E7%BE%BD%E7%BB%92%E6%9C%8D?PTAG=20036.9.213 Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER 8 organic norefer 0 SHANDONG JINAN 3079 {"norefer":1,"tx_weibo":1} 2012-12-23 tx_weibo

select sessid,sub_channel,market_from,prod_and_ts from user_session_stat LATERAL VIEW explode(map_keys(market_from)) channel_cc as prod_and_ts where dt='2012-12-23';

00002b681cb0e0ad81212232313359e3 norefer {"norefer":1} norefer
0000480e5b010d80c121223160448bdb norefer {"norefer":1} norefer
000048693e1e5f7cd12113011144502a norefer {"norefer":1,"tx_weibo":1} norefer
000048693e1e5f7cd12113011144502a norefer {"norefer":1,"tx_weibo":1} tx_weibo
00007a56dbbe2cb1e1212230031058e6 norefer {"norefer":1} norefer
00007be99d947ee54121223135026115 norefer {"norefer":1} norefer

6. 替换Mysql-->show tables like '%%'
show tables '*cpc*';

7. regexp_extract(uri, '/share/([0-9]+)', 1) 正则截取

官方UDF:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

8、hive本身强大的正则:

实例:表A,id= 110, 字段json

{"cnt":1,"value":[{"time":1390543558,"index":1,"level":190,"name":"A97M/AccessiV.B virus"},{"time":0,"index":0,"level":0,"name":" "},{"time":0,"index":0,"level":0,"name":" "}]}

查询, value第一个值里的 name :

select id, get_json_object(A.json, '$.value[0].name') from A id=110;
get_json_object

A limited version of JSONPath is supported:

  • $ : Root object

  • . : Child operator

  • [] : Subscript operator for array

  • * : Wildcard for []

Syntax not supported that's worth noticing:

  • : Zero length string as key

  • .. : Recursive descent

  • @ : Current object/element

  • () : Script expression

  • ?() : Filter (script) expression.

  • [,] : Union operator

  • [start:end.step] : array slice operator

Example: src_json table is a single column (json), single row table:

+----+ json +----+ { "store" : { "fruit" :\[{ "weight" : 8 , "type" : "apple" },{ "weight" : 9 , "type" : "pear" }], "bicycle" :{ "price" : 19.95 , "color" : "red" } }, "email" : "amy@only_for_json_udf_test.net" , "owner" : "amy" } +----+

The fields of the json object can be extracted using these queries:

hive> SELECT get_json_object(src_json.json, '$.owner' ) FROM src_json; amy hive> SELECT get_json_object(src_json.json, '$.store.fruit\[0]' ) FROM src_json; { "weight" : 8 , "type" : "apple" } hive> SELECT get_json_object(src_json.json, '$.non_exist_key' ) FROM src_json; NULL

关于"hive常用小技巧有哪些"这篇文章就分享到这里了,希望以上内容可以对大家有一定的帮助,使各位可以学到更多知识,如果觉得文章不错,请把它分享出去让更多的人看到。

常用 e.g. 篇文章 访问量 查询 技巧 数据 更多 正则 不错 实用 强大 内容 区间 字典 字段 官方 实例 就是 数组 数据库的安全要保护哪些东西 数据库安全各自的含义是什么 生产安全数据库录入 数据库的安全性及管理 数据库安全策略包含哪些 海淀数据库安全审计系统 建立农村房屋安全信息数据库 易用的数据库客户端支持安全管理 连接数据库失败ssl安全错误 数据库的锁怎样保障安全 网络安全法知识学习心得体会 全球药典草药基因组数据库 网络安全问题统计数据 怎么加入服务器我的世界手机版 在网络安全的概念中 浮生若卿歌所有服务器姓名 计算机网络安全的安全措施 软件开发改行可以做什么 单招软件开发与应用的学校 爱兔网络技术有限公司 数据库升级会影响数据 数据库建表时指定唯一索引 sql的自带数据库在哪里 该文件已经是数据库的一部分 中学生网络安全知识黑板报 关于网络安全保密性 数据库添加字段可选项 北京时间 同步服务器 唐海软件开发公司 网络安全宣传小黑板 怎么加入服务器我的世界手机版 软件开发有哪些构架 软件开发面临哪些挑战 卫辉网络技术专业报名分数线 王者荣耀选择服务器调整 can总线网络安全 阳江市软件开发 单招网络技术技能考什么 计算机能只学软件开发吗 电子信息与网络技术的区别
0