第55课:实战Hive分析搜索引擎的数据
发表于:2024-12-01 作者:千家信息网编辑
千家信息网最后更新 2024年12月01日,一、获取数据搜狗实验室为我们提供了用户使用搜狗搜索引擎查询的日志,下载地址为http://download.labs.sogou.com/dl/q.html本文选择下载精简版。数据格式如下:二、上传数
千家信息网最后更新 2024年12月01日第55课:实战Hive分析搜索引擎的数据
一、获取数据
搜狗实验室为我们提供了用户使用搜狗搜索引擎查询的日志,下载地址为
http://download.labs.sogou.com/dl/q.html
本文选择下载精简版。
数据格式如下:
二、上传数据至HDFS
建立hdfs目录
root@spark-master:~# hdfs dfs -mkdir -p /library/sougou
上传文件
root@spark-master:~# hdfs dfs -put SogouQ1.txt /library/sougouroot@spark-master:~# hdfs dfs -put SogouQ2.txt /library/sougouroot@spark-master:~# hdfs dfs -put SogouQ3.txt /library/sougouroot@spark-master:~#
三、使用Hive创建表
root@spark-master:/usr/local/hive/apache-hive-1.2.1/bin# ./hiveSLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]Logging initialized using configuration in jar:file:/usr/local/hive/apache-hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.propertieshive> CREATE TABLE SOUGOU(ID STRING,WEBSESSION STRING,WORD STRING,S_SEQ INT,C_SEQ INT ,WEBSITE STRING) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';OKTime taken: 1.995 secondshive>
四、加载数据
hive> LOAD DATA INPATH '/library/sougou/SogouQ1.txt' INTO TABLE sougou;Loading data to table default.sougouTable default.sougou stats: [numFiles=1, totalSize=108750574]OKTime taken: 1.101 seconds
此时,我们再次查看源目录
SogouQ1.txt 已经没有啦,该文件跑哪里去了呢?
可见,导入数据其实就是将HDFS上的文件移动一个位置而已。所以速度是非常的快。
那可不可以直接将SogouQ1.txt放置在HDFS的/user/hive/warehouse/sougou/中,而不使用LOAD语句?
因为元数据要知道该表中包含了哪些数据文件,所以必须使用load语句。
五、操作数据
5.1 计算count
hive> select count(*) from sougou;Query ID = root_20160314192407_792e772a-c225-4b37-b948-7050d6b529b4Total jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapreduce.job.reduces= Starting Job = job_1457942575478_0002, Tracking URL = http://spark-master:8088/proxy/application_1457942575478_0002/Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457942575478_0002Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 12016-03-14 19:24:29,014 Stage-1 map = 0%, reduce = 0%2016-03-14 19:24:47,137 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.14 sec2016-03-14 19:25:04,563 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 9.38 secMapReduce Total cumulative CPU time: 9 seconds 380 msecEnded Job = job_1457942575478_0002MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 9.38 sec HDFS Read: 108757501 HDFS Write: 8 SUCCESSTotal MapReduce CPU Time Spent: 9 seconds 380 msecOK1000000Time taken: 58.603 seconds, Fetched: 1 row(s)
5.2 查看数据
hive> select * from sougou limit 5;OK20111230000005 57375476989eea12893c0c3811607bcf 1 1 http://www.qiyi.com/20111230000005 66c5bb7774e31d0a22278249b26bc83a 3 1 http://www.booksky.org/BookDetail.aspx?BookID=1050804&Level=120111230000007 b97920521c78de70ac38e3713f524b50 1 1 http://www.bblianmeng.com/20111230000008 6961d0c97fe93701fc9c0d861d096cd9 1 1 http://lib.scnu.edu.cn/20111230000008 f2f5a21c764aebde1e8afcc2871e086f 2 1 http://proxyie.cn/Time taken: 0.246 seconds, Fetched: 5 row(s)
这里出现了乱码,原因是源文件是gb3212编码,但是Hadoop和Hive都使用UTF8编码。我们将文件转码后再次上传到hdfs中
root@spark-master:~# iconv -t utf-8 -f gb2312 -c SogouQ1.txt > SogouQ1.txt.utf8root@spark-master:~# rm SogouQ1.txt ; mv SogouQ1.txt.utf8 SogouQ1.txtroot@spark-master:~# hdfs dfs -rm /user/hive/warehouse/sougou/SogouQ1.txt16/03/14 19:44:25 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.Deleted /user/hive/warehouse/sougou/SogouQ1.txtroot@spark-master:~# hdfs dfs -put SogouQ1.txt /user/hive/warehouse/sougou/root@spark-master:~#
再次查看
hive> select * from sougou limit 5;OK20111230000005 57375476989eea12893c0c3811607bcf 奇艺高清 1 1 http://www.qiyi.com/20111230000005 66c5bb7774e31d0a22278249b26bc83a 凡人修仙传 3 1 http://www.booksky.org/BookDetail.aspx?BookID=1050804&Level=120111230000007 b97920521c78de70ac38e3713f524b50 本本联盟 1 1 http://www.bblianmeng.com/20111230000008 6961d0c97fe93701fc9c0d861d096cd9 华南师范大学图书馆 1 1 http://lib.scnu.edu.cn/20111230000008 f2f5a21c764aebde1e8afcc2871e086f 在线代理 2 1 http://proxyie.cn/Time taken: 0.151 seconds, Fetched: 5 row(s)
这样就正常啦。
5.2 再来一个复杂点的查询
hive> select count(*) from sougou where s_seq=1 and c_seq=1 and website like '%baidu%';Query ID = root_20160314194855_8c9aa844-e088-4695-942f-3579718962f6Total jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapreduce.job.reduces= Starting Job = job_1457942575478_0003, Tracking URL = http://spark-master:8088/proxy/application_1457942575478_0003/Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457942575478_0003Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 12016-03-14 19:49:12,041 Stage-1 map = 0%, reduce = 0%2016-03-14 19:49:33,174 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.94 sec2016-03-14 19:49:48,672 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.55 secMapReduce Total cumulative CPU time: 11 seconds 550 msecEnded Job = job_1457942575478_0003MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 11.55 sec HDFS Read: 114834192 HDFS Write: 6 SUCCESSTotal MapReduce CPU Time Spent: 11 seconds 550 msecOK59090Time taken: 55.022 seconds, Fetched: 1 row(s)
查询点击排名
hive> select word,count(*) cnt from sougou group by word order by cnt desc limit 5;Query ID = root_20160314202108_58aeca03-8ed6-4626-b15e-af6643c94107Total jobs = 2Launching Job 1 out of 2Number of reduce tasks not specified. Estimated from input data size: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapreduce.job.reduces= Starting Job = job_1457942575478_0007, Tracking URL = http://spark-master:8088/proxy/application_1457942575478_0007/Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457942575478_0007Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 12016-03-14 20:21:29,040 Stage-1 map = 0%, reduce = 0%2016-03-14 20:21:57,425 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 14.98 sec2016-03-14 20:22:16,021 Stage-1 map = 100%, reduce = 68%, Cumulative CPU 20.27 sec2016-03-14 20:22:19,268 Stage-1 map = 100%, reduce = 77%, Cumulative CPU 23.16 sec2016-03-14 20:22:22,593 Stage-1 map = 100%, reduce = 93%, Cumulative CPU 25.9 sec2016-03-14 20:22:23,721 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 26.9 secMapReduce Total cumulative CPU time: 26 seconds 900 msecEnded Job = job_1457942575478_0007Launching Job 2 out of 2Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer= In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapreduce.job.reduces= Starting Job = job_1457942575478_0008, Tracking URL = http://spark-master:8088/proxy/application_1457942575478_0008/Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457942575478_0008Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 12016-03-14 20:22:44,377 Stage-2 map = 0%, reduce = 0%2016-03-14 20:23:07,303 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 9.95 sec2016-03-14 20:23:25,482 Stage-2 map = 100%, reduce = 82%, Cumulative CPU 15.54 sec2016-03-14 20:23:26,563 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 16.88 secMapReduce Total cumulative CPU time: 16 seconds 880 msecEnded Job = job_1457942575478_0008MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 26.9 sec HDFS Read: 114832713 HDFS Write: 15044297 SUCCESSStage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 16.88 sec HDFS Read: 15048892 HDFS Write: 153 SUCCESSTotal MapReduce CPU Time Spent: 43 seconds 780 msecOK百度 7564baidu 3652人体艺术 2786馆陶县县长闫宁的父亲 23884399小游戏 2119Time taken: 140.18 seconds, Fetched: 5 row(s)
六、外部表
我们在第三步创建的表是内部表,内部表创建成功后会在/user/hive/warehouse下创建和表同名的目录。并且当导入数据时,源文件会被放置在表对应的目录下。当进行表删除时,目录和文件一同被删除
hive> drop table sougou;OKTime taken: 0.983 seconds
查看hdfs
root@spark-master:~# hdfs dfs -ls /user/hive/warehouse/Found 1 itemsdrwxr-xr-x - root supergroup 0 2016-03-14 17:10 /user/hive/warehouse/t1
Hive还提供了另一种表,称之为外部表。
表创建方式如下:
hive> CREATE EXTERNAL TABLE SOUGOU(ID STRING,WEBSESSION STRING,WORD STRING,S_SEQ INT,C_SEQ INT ,WEBSITE STRING) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' > STORED AS TEXTFILE LOCATION '/library/sougou/sougouExternal';OKTime taken: 0.123 seconds
root@spark-master:~# hdfs dfs -ls /user/hive/warehouse/16/03/14 20:02:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableFound 1 itemsdrwxr-xr-x - root supergroup 0 2016-03-14 17:10 /user/hive/warehouse/t1root@spark-master:~# hdfs dfs -ls /library/sougou/16/03/14 20:02:29 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableFound 3 items-rw-r--r-- 3 root supergroup 217441417 2016-03-14 18:55 /library/sougou/SogouQ2.txt-rw-r--r-- 3 root supergroup 1086552775 2016-03-14 18:56 /library/sougou/SogouQ3.txtdrwxr-xr-x - root supergroup 0 2016-03-14 20:01 /library/sougou/sougouExternal
目录直接创建在指定的位置。
上传文件
root@spark-master:~# hdfs dfs -put SogouQ1.txt /library/sougou/sougouExternal
在Hive中查询数据
hive> select count(*) from sougou;Query ID = root_20160314200414_b514251b-58d3-40aa-a9ee-4a9cf5eef8f2Total jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapreduce.job.reduces= Starting Job = job_1457942575478_0004, Tracking URL = http://spark-master:8088/proxy/application_1457942575478_0004/Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457942575478_0004Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 12016-03-14 20:04:27,514 Stage-1 map = 0%, reduce = 0%2016-03-14 20:04:41,458 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.66 sec2016-03-14 20:04:52,341 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.26 secMapReduce Total cumulative CPU time: 7 seconds 260 msecEnded Job = job_1457942575478_0004MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.26 sec HDFS Read: 114832746 HDFS Write: 8 SUCCESSTotal MapReduce CPU Time Spent: 7 seconds 260 msecOK1000000Time taken: 39.823 seconds, Fetched: 1 row(s)
外部表被删除后,hdfs上的文件并不会被删除
hive> drop table sougou;OKTime taken: 0.363 seconds
root@spark-master:~# hdfs dfs -ls /library/sougou/sougouExternal/16/03/14 20:16:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableFound 1 items-rw-r--r-- 3 root supergroup 114825752 2016-03-14 20:03 /library/sougou/sougouExternal/SogouQ1.txt
数据
文件
目录
查询
再次
位置
搜狗
源文件
编码
语句
引擎
搜索引擎
复杂
成功
可不
乱码
人体
人体艺术
凡人
原因
数据库的安全要保护哪些东西
数据库安全各自的含义是什么
生产安全数据库录入
数据库的安全性及管理
数据库安全策略包含哪些
海淀数据库安全审计系统
建立农村房屋安全信息数据库
易用的数据库客户端支持安全管理
连接数据库失败ssl安全错误
数据库的锁怎样保障安全
湛江rpa软件开发公司
腾讯云服务器可以语音转文字吗
现在常用的数据库处理软件
服务器系统7.6
河南网通服务器托管
软件开发新趋势预测论文
上海火芽网络技术
宁德巿迪通网络技术有限公司
天粹英华跨服买东西有哪些服务器
谷歌服务器自带虚拟机
国家人脸数据库
工资管理软件开发
防灾网络安全心得
网站服务器可以移到其他公司吗
打开数据库的语法格式是
上海net软件开发靠谱吗
窗体是不是数据库对象
西塘古镇服务器
无锡知名服务器价格咨询
显示器直联服务器
腾讯网络安全总监
小迪网络安全第一期
网络安全和游戏研发哪个更好
汽车租赁数据库设计
数据库新技术课件
圣安地列斯私人服务器
服务器被恶意镜像了怎么办
三级网络技术真题百度云
怎么修改数据库名
数据库怎样修改非空唯一