千家信息网

运行Hadoop自带的wordcount单词统计程序

发表于:2025-02-01 作者:千家信息网编辑
千家信息网最后更新 2025年02月01日,0.前言前面一篇《Hadoop初体验:快速搭建Hadoop伪分布式环境》搭建了一个Hadoop的环境,现在就使用Hadoop自带的wordcount程序来做单词统计的案例。1.使用示例程序实现单词统计
千家信息网最后更新 2025年02月01日运行Hadoop自带的wordcount单词统计程序

0.前言


前面一篇《Hadoop初体验:快速搭建Hadoop伪分布式环境》搭建了一个Hadoop的环境,现在就使用Hadoop自带的wordcount程序来做单词统计的案例。




1.使用示例程序实现单词统计


(1)wordcount程序

wordcount程序在hadoop的share目录下,如下:

[root@leaf mapreduce]# pwd/usr/local/hadoop/share/hadoop/mapreduce[root@leaf mapreduce]# lshadoop-mapreduce-client-app-2.6.5.jar         hadoop-mapreduce-client-jobclient-2.6.5-tests.jarhadoop-mapreduce-client-common-2.6.5.jar      hadoop-mapreduce-client-shuffle-2.6.5.jarhadoop-mapreduce-client-core-2.6.5.jar        hadoop-mapreduce-examples-2.6.5.jarhadoop-mapreduce-client-hs-2.6.5.jar          libhadoop-mapreduce-client-hs-plugins-2.6.5.jar  lib-exampleshadoop-mapreduce-client-jobclient-2.6.5.jar   sources

就是这个hadoop-mapreduce-examples-2.6.5.jar程序。

(2)创建HDFS数据目录

创建一个目录,用于保存MapReduce任务的输入文件:

[root@leaf ~]# hadoop fs -mkdir -p /data/wordcount

创建一个目录,用于保存MapReduce任务的输出文件:

[root@leaf ~]# hadoop fs -mkdir /output

查看刚刚创建的两个目录:

[root@leaf ~]# hadoop fs -ls /drwxr-xr-x   - root supergroup          0 2017-09-01 20:34 /datadrwxr-xr-x   - root supergroup          0 2017-09-01 20:35 /output


(3)创建一个单词文件,并上传到HDFS

创建的单词文件如下:

[root@leaf ~]# cat myword.txt leaf yyhyyh xpleafkaty lingyeyonghao leafxpleaf katy

上传该文件到HDFS中:

[root@leaf ~]# hadoop fs -put myword.txt /data/wordcount

在HDFS中查看刚刚上传的文件及内容:

[root@leaf ~]# hadoop fs -ls /data/wordcount-rw-r--r--   1 root supergroup         57 2017-09-01 20:40 /data/wordcount/myword.txt[root@leaf ~]# hadoop fs -cat /data/wordcount/myword.txtleaf yyhyyh xpleafkaty lingyeyonghao leafxpleaf katy


(4)运行wordcount程序

执行如下命令:

[root@leaf ~]# hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.5.jar wordcount /data/wordcount /output/wordcount...17/09/01 20:48:14 INFO mapreduce.Job: Job job_local1719603087_0001 completed successfully17/09/01 20:48:14 INFO mapreduce.Job: Counters: 38        File System Counters                FILE: Number of bytes read=585940                FILE: Number of bytes written=1099502                FILE: Number of read operations=0                FILE: Number of large read operations=0                FILE: Number of write operations=0                HDFS: Number of bytes read=114                HDFS: Number of bytes written=48                HDFS: Number of read operations=15                HDFS: Number of large read operations=0                HDFS: Number of write operations=4        Map-Reduce Framework                Map input records=5                Map output records=10                Map output bytes=97                Map output materialized bytes=78                Input split bytes=112                Combine input records=10                Combine output records=6                Reduce input groups=6                Reduce shuffle bytes=78                Reduce input records=6                Reduce output records=6                Spilled Records=12                Shuffled Maps =1                Failed Shuffles=0                Merged Map outputs=1                GC time elapsed (ms)=92                CPU time spent (ms)=0                Physical memory (bytes) snapshot=0                Virtual memory (bytes) snapshot=0                Total committed heap usage (bytes)=241049600        Shuffle Errors                BAD_ID=0                CONNECTION=0                IO_ERROR=0                WRONG_LENGTH=0                WRONG_MAP=0                WRONG_REDUCE=0        File Input Format Counters                 Bytes Read=57        File Output Format Counters                 Bytes Written=48


(5)查看统计结果

如下:

[root@leaf ~]# hadoop fs -cat /output/wordcount/part-r-00000katy    2leaf    2ling    1xpleaf  2yeyonghao       1yyh     2




3.参考资料


http://www.aboutyun.com/thread-7713-1-1.html

0