hdfs如何实现数据压缩
这篇文章主要介绍hdfs如何实现数据压缩,文中介绍的非常详细,具有一定的参考价值,感兴趣的小伙伴们一定要看完!
公司一共不到30台的hadoop集群,hdfs大小共有120T,最近监控老是报警,磁盘不足(低于5%时候报警),之前一直忙于业务,没时间整理集群,整理之后发现现有文件一共在34T左右,加上3份冗余,整个hdfs占用在103T,之前清洗的时候直接是文本存入,且没有进行任何压缩,这块儿应该会有很大的优化空间。其中有一份记录用户手机安装应用的日志文件占用在5T左右,先拿他下手。
因为hive有三种文件存储格式,TEXTFILE、SEQUENCEFILE、RCFILE,其中前两个是基于行存储,RCFile是Hive推出的一种专门面向列的数据格式。 它遵循"先按列划分,再垂直划分"的设计理念,当查询过程中,针对它并不关心的列时,它会在IO上跳过这些列,所以选择RCFILE,再用Gzip压缩。
之间还犯了一个比较2的错误:因为之前有同事调研过rcfile(已离职),所以用show create table XX的方式查看建表语句,发现是
CREATE EXTERNAL TABLE XX( ...... )PARTITIONED BY ( day int)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'LOCATION '/user/hive/data/XX';
就照搬改一下字段,建了一张app_install的RCFile表,sql导入之前的数据
set mapred.job.priority=VERY_HIGH;set hive.merge.mapredfiles=true;set hive.merge.smallfiles.avgsize=200000000;set hive.exec.compress.output=true;set mapred.output.compress=true; set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec; set mapred.job.name=app_install.$_DAY;insert overwrite table app_install1 PARTITION (day=$_DAY)select XXX from tb1 where day=$_DAY
报错,查看hadoop运行日志,发现是
FATAL ExecReducer: java.lang.UnsupportedOperationException: Currently the writer can only accept BytesRefArrayWritableat org.apache.hadoop.hive.ql.io.RCFile$Writer.append(RCFile.java:880)at org.apache.hadoop.hive.ql.io.RCFileOutputFormat$2.write(RCFileOutputFormat.java:140)at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:588)at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:389)at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:715)at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:697)at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:697)at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:856)at org.apache.hadoop.hive.ql.exec.JoinOperator.endGroup(JoinOperator.java:265)at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:198)at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)at org.apache.hadoop.mapred.Child$4.run(Child.java:255)at javax.security.auth.Subject.doAs(Subject.java:396)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)at org.apache.hadoop.mapred.Child.main(Child.java:249)
网上说是hive的一个bug,一直以为就是这个bug,折腾了一天,最后试着按照网上的方式修改了一下建表语句
REATE EXTERNAL TABLE XX( ...... )PARTITIONED BY ( day int)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS RCFILELOCATION '/user/hive/data/XX';
结果正常运行,然后用show create table XX查看语句发现又变成了
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
郁闷死了,就是建表语句然后用show create table显示的不一样导致,虽然是个小问题,但是也颇费经历。
以上是"hdfs如何实现数据压缩"这篇文章的所有内容,感谢各位的阅读!希望分享的内容对大家有帮助,更多相关知识,欢迎关注行业资讯频道!