第57课:Spark SQL on Hive配置及实战
发表于:2025-01-28 作者:千家信息网编辑
千家信息网最后更新 2025年01月28日,1,首先需要安装hive,参考http://lqding.blog.51cto.com/9123978/17509672,在spark的配置目录下添加配置文件,让Spark可以访问hive的metas
千家信息网最后更新 2025年01月28日第57课:Spark SQL on Hive配置及实战
1,首先需要安装hive,参考http://lqding.blog.51cto.com/9123978/1750967
2,在spark的配置目录下添加配置文件,让Spark可以访问hive的metastore。
root@spark-master:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf# vi hive-site.xmlhive.metastore.uris thrift://spark-master:9083 Thrift uri for the remote metastore. Used by metastore client to connect to remote metastore.
3,将MySQL jdbc驱动copy到spark的lib目录下
root@spark-master:/usr/local/hive/apache-hive-1.2.1/lib# cp mysql-connector-java-5.1.36-bin.jar /usr/local/spark/spark-1.6.0-bin-hadoop2.6/lib/
4,启动Hive的metastore服务
root@spark-master:/usr/local/hive/apache-hive-1.2.1/bin# ./hive --service metastore &[1] 20518root@spark-master:/usr/local/hive/apache-hive-1.2.1/bin# SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]Starting Hive Metastore ServerSLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
5,启动spark-shell
root@spark-master:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/bin# ./spark-shell --master spark://spark-master:7077
生成hiveContext
scala> val hc = new org.apache.spark.sql.hive.HiveContext(sc);
执行sql
scala> hc.sql("show tables").collect.foreach(println)[sougou,false][t1,false]scala> hc.sql("select count(*) from sougou").collect.foreach(println)16/03/14 23:15:58 INFO parse.ParseDriver: Parsing command: select count(*) from sougou16/03/14 23:16:00 INFO parse.ParseDriver: Parse Completed16/03/14 23:16:01 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps16/03/14 23:16:02 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 474.9 KB, free 474.9 KB)16/03/14 23:16:02 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 41.6 KB, free 516.4 KB)16/03/14 23:16:02 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.199.100:41635 (size: 41.6 KB, free: 517.4 MB)16/03/14 23:16:02 INFO spark.SparkContext: Created broadcast 0 from collect at:3016/03/14 23:16:03 INFO mapred.FileInputFormat: Total input paths to process : 116/03/14 23:16:03 INFO spark.SparkContext: Starting job: collect at :3016/03/14 23:16:03 INFO scheduler.DAGScheduler: Registering RDD 5 (collect at :30)16/03/14 23:16:03 INFO scheduler.DAGScheduler: Got job 0 (collect at :30) with 1 output partitions16/03/14 23:16:03 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (collect at :30)16/03/14 23:16:03 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)16/03/14 23:16:04 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 0)16/03/14 23:16:04 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[5] at collect at :30), which has no missing parents16/03/14 23:16:04 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 13.8 KB, free 530.2 KB)16/03/14 23:16:04 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 6.9 KB, free 537.1 KB)16/03/14 23:16:04 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.199.100:41635 (size: 6.9 KB, free: 517.4 MB)16/03/14 23:16:04 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:100616/03/14 23:16:04 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[5] at collect at :30)16/03/14 23:16:04 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks16/03/14 23:16:04 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, spark-worker2, partition 0,NODE_LOCAL, 2152 bytes)16/03/14 23:16:04 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, spark-worker1, partition 1,NODE_LOCAL, 2152 bytes)16/03/14 23:16:05 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on spark-worker2:55899 (size: 6.9 KB, free: 146.2 MB)16/03/14 23:16:05 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on spark-worker1:38231 (size: 6.9 KB, free: 146.2 MB)16/03/14 23:16:09 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on spark-worker1:38231 (size: 41.6 KB, free: 146.2 MB)16/03/14 23:16:10 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on spark-worker2:55899 (size: 41.6 KB, free: 146.2 MB)16/03/14 23:16:16 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 12015 ms on spark-worker1 (1/2)16/03/14 23:16:16 INFO scheduler.DAGScheduler: ShuffleMapStage 0 (collect at :30) finished in 12.351 s16/03/14 23:16:16 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 12341 ms on spark-worker2 (2/2)16/03/14 23:16:16 INFO scheduler.DAGScheduler: looking for newly runnable stages16/03/14 23:16:16 INFO scheduler.DAGScheduler: running: Set()16/03/14 23:16:16 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 1)16/03/14 23:16:16 INFO scheduler.DAGScheduler: failed: Set()16/03/14 23:16:16 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[8] at collect at :30), which has no missing parents16/03/14 23:16:16 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 16/03/14 23:16:16 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 12.9 KB, free 550.1 KB)16/03/14 23:16:16 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 6.4 KB, free 556.5 KB)16/03/14 23:16:16 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.199.100:41635 (size: 6.4 KB, free: 517.4 MB)16/03/14 23:16:16 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:100616/03/14 23:16:16 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[8] at collect at :30)16/03/14 23:16:16 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks16/03/14 23:16:16 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, spark-worker1, partition 0,NODE_LOCAL, 1999 bytes)16/03/14 23:16:16 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on spark-worker1:38231 (size: 6.4 KB, free: 146.1 MB)16/03/14 23:16:17 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to spark-worker1:4356816/03/14 23:16:17 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 158 bytes16/03/14 23:16:18 INFO scheduler.DAGScheduler: ResultStage 1 (collect at :30) finished in 1.288 s16/03/14 23:16:18 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 1279 ms on spark-worker1 (1/1)16/03/14 23:16:18 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 16/03/14 23:16:18 INFO scheduler.DAGScheduler: Job 0 finished: collect at :30, took 14.285673 s[1000000]
跟Hive相比,速度是有所提升的。如果是复杂的语句,相比hive速度将更加的快。
scala> hc.sql("select word,count(*) cnt from sougou group by word order by cnt desc limit 5").collect.foreach(println)....16/03/14 23:19:16 INFO scheduler.DAGScheduler: ResultStage 3 (collect at:30) finished in 11.900 s16/03/14 23:19:16 INFO scheduler.DAGScheduler: Job 1 finished: collect at :30, took 17.925094 s16/03/14 23:19:16 INFO scheduler.TaskSetManager: Finished task 195.0 in stage 3.0 (TID 200) in 696 ms on spark-worker2 (200/200)16/03/14 23:19:16 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool [百度,7564][baidu,3652][人体艺术,2786][馆陶县县长闫宁的父亲,2388][4399小游戏,2119]
之前使用Hive需要跑将近110s,而使用Spark SQL仅需17s
配置
目录
速度
复杂
人体
人体艺术
仅需
县长
小游戏
文件
父亲
艺术
语句
馆陶
馆陶县
参考
服务
生成
驱动
实战
数据库的安全要保护哪些东西
数据库安全各自的含义是什么
生产安全数据库录入
数据库的安全性及管理
数据库安全策略包含哪些
海淀数据库安全审计系统
建立农村房屋安全信息数据库
易用的数据库客户端支持安全管理
连接数据库失败ssl安全错误
数据库的锁怎样保障安全
获取链接数据库的某个字段
微信网络安全管理制度
war是什么软件开发的
Dell服务器在整车厂的应用
如何查看当前服务器的操作系统
数据库文件版本丢失
部队网络安全问题发言稿
服务器点播系统
江苏网络软件开发收购价
浪潮服务器电源亮红灯
根据网络安全法国家措施
qq的数据库表
数据库基础知识简答题
联想服务器如何查询阵列卡
厦门集美应用软件开发
菜鸟网络技术创新的历史
厦门服务器回收哪家服务好
杰冠网络技术有限公司
软件开发比较好的公司有哪些
sql备份数据库哪几个
深圳海尔软件开发
java写数据库的逻辑是什么
为什么要创建网络数据库
上财网络技术中心
数字化期刊全文数据库怎么进
两会网络安全与360有关吗
XULEI下载软件开发
数据库营销的秘密 微码营销
sql数据库培训班
mpp数据库商用