千家信息网

hadoop2.6.5+sqoop1.4.6环境部署与测试(二)

发表于:2025-01-23 作者:千家信息网编辑
千家信息网最后更新 2025年01月23日,首先再看一下四台VM在集群中担任的角色信息:IP 主机名 hadoop集群担任角色10.0.1.100 hadoop-test-nn NameN
千家信息网最后更新 2025年01月23日hadoop2.6.5+sqoop1.4.6环境部署与测试(二)

首先再看一下四台VM在集群中担任的角色信息:

IP            主机名            hadoop集群担任角色10.0.1.100    hadoop-test-nn    NameNode,ResourceManager10.0.1.101    hadoop-test-snn    SecondaryNameNode10.0.1.102    hadoop-test-dn1    DataNode,NodeManager10.0.1.103    hadoop-test-dn2    DataNode,NodeManager

1. 将得到的hadoop-2.6.5.tar.gz 解压到/usr/local/下,并建立/usr/local/hadoop软链接。

mv hadoop-2.6.5.tar.gz /usr/local/tar -xvf hadoop-2.6.5.tar.gzln -s /usr/local/hadoop-2.6.5 /usr/local/hadoop

2. 将/usr/local/hadoop,/usr/local/hadoop-2.6.5属主属组修改为hadoop,保证hadoop用户可以使用:

chown -R hadoop:hadoop /usr/local/hadoop-2.6.5chown -R hadoop:hadoop /usr/local/hadoop

3. 为方便使用,配置HADOOP_HOME变量和修改PATH变量,在/etc/profile中添加如下记录:

export HADOOP_HOME=/usr/local/hadoopexport PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

4. hadoop的配置文件存放在$HADOOP_HOME/etc/hadoop/目录下,我们通过对该目录下配置文件中的属性进行修改来完成环境搭建工作:
1)修改hadoop-env.sh脚本,设置该脚本中的JAVA_HOME变量:

#在hadoop-env.sh中注释并添加如下行#export JAVA_HOME=${JAVA_HOME}export JAVA_HOME=/usr/local/java/jdk1.7.0_45

2)创建masters文件,该文件用于指定哪些主机担任SecondaryNameNode的角色,在master文件中添加SecondaryNameNode的主机名:

#在masters添加如下行hadoop-test-snn

3)创建slaves文件,该文件用于指定哪些主机担任DataNode的角色,在slaves文件中添加DataNode的主机名:

#在slaves添加如下行hadoop-test-dn1hadoop-test-dn2

4)修改core-site.xml文件中的属性值,设置hdfs的url和hdfs临时文件目录:

  fs.defaultFS  hdfs://hadoop-test-nn:8020  hadoop.tmp.dir  /hadoop/dfs/tmp

5)修改hdfs-site.xml文件中的属性值,进行hdfs,NameNode,DataNode相关的属性配置:

  dfs.http.address  hadoop-test-nn:50070  dfs.namenode.secondary.http-address  hadoop-test-snn:50090  dfs.namenode.name.dir  /hadoop/dfs/name  dfs.datanode.name.dir  /hadoop/dfs/data  dfs.datanode.ipc.address  0.0.0.0:50020  dfs.datanode.http.address  0.0.0.0:50075  dfs.replication  2

属性值说明:
dfs.http.address:NameNode的web监控页面地址,默认监听在50070端口
dfs.namenode.secondary.http-address: SecondaryNameNode的web监控页面地址,默认监听在50090端口
dfs.namenode.name.dir:NameNode元数据在hdfs上保存的位置
dfs.datanode.name.dir:DataNode元数据在hdfs上保存的位置
dfs.datanode.ipc.address:DataNode的ipc监听端口,该端口通过心跳传输信息给NameNode
dfs.datanode.http.address:DataNode的web监控页面地址,默认监听在50075端口
dfs.replication:hdfs上每份数据的复制份数

6)修改mapred-site.xml,开发框架采用yarn架构:

  mapreduce.framework.name  yarn

7)既然采用了yarn架构,就有必要对yarn的相关属性进行配置,在yarn-site.xml中进行如下修改:

  yarn.nodemanager.aux-services  mapreduce_shuffle  yarn.resourcemanager.hostname  hadoop-test-nn  The address of the applications manager interface  yarn.resourcemanager.address  ${yarn.resourcemanager.hostname}:8040  The address of the scheduler interface  yarn.resourcemanager.scheduler.address  ${yarn.resourcemanager.hostname}:8030  The http address of the RM web application.  yarn.resourcemanager.webapp.address  ${yarn.resourcemanager.hostname}:8088  yarn.resourcemanager.resource-tracker.address  ${yarn.resourcemanager.hostname}:8025

属性值说明:
yarn.resourcemanager.hostname:ResourceManager所在节点主机名
yarn.nodemanager.aux-services:在NodeManager节点上进行扩展服务的配置,指定为mapreduce-shuffle时,我们编写的mapreduce程序就可以实现从map task输出到reduce task
yarn.resourcemanager.address:NodeManager通过该端口同ResourceManager进行通信,默认监听在8032端口(本文所用配置修改了端口)
yarn.resourcemanager.scheduler.address:ResourceManager提供的调度服务接口地址,也是在eclipse中配置mapreduce location时,Map/Reduce Master一栏所填的地址。默认监听在8030端口
yarn.resourcemanager.webapp.address:ResourceManager的web监控页面地址,默认监听在8088端口
yarn.resourcemanager.resource-tracker.address:NodeManager通过该端口向ResourceManager报告任务运行状态以便ResourceManagerg跟踪任务。默认监听在8031端口(本文所用配置修改了端口)

还有其他属性值,如yarn.resourcemanager.admin.address 用于发送管理命令的地址、yarn.resourcemanager.resource-tracker.client.thread-count 可以处理的通过RPC请求发送过来的handler个数等,如果需要,请在该配置文件中添加。
8)将修改过的配置文件复制到各个节点:

scp core-site.xml hdfs-site.xml mapred-site.xml masters slaves yarn-site.xml hadoop-test-snn:/usr/local/hadoop/etc/hadoop/scp core-site.xml hdfs-site.xml mapred-site.xml masters slaves yarn-site.xml hadoop-test-dn1:/usr/local/hadoop/etc/hadoop/scp core-site.xml hdfs-site.xml mapred-site.xml masters slaves yarn-site.xml hadoop-test-dn2:/usr/local/hadoop/etc/hadoop/

9)NameNode格式化操作。第一次使用hdfs时,需要对NameNode节点进行格式化操作,而格式化的路径应为hdfs-site.xml中众多以dir结尾命名的属性所指定的路径的父目录,这里指定的路径都是文件系统上的绝对路径。如果用户对其父目录具有完全控制权限时,这些属性指定的目录是可以在hdfs启动时被自动创建。

因此首先建立/hadoop目录,并更改该目录属主属组为hadoop:

mkdir /hadoopchown -R hadoop:hadoop /hadoop

再使用hadoop用户进行NameNode的格式化操作:

su - hadoop$HADOOP_HOME/bin/hdfs namenode -format

注:请关注该命令执行过程中输出的日志信息,如果出现错误或异常提示,请先检查指定目录的权限,问题有可能出在这里。

10)启动hadoop集群服务:在NameNode成功格式化以后,可以使用$HADOOP_HOME/sbin/下的脚本来启停节点的服务,在NameNode节点上可以使用start/stop-yarn.sh和start/stop-dfs.sh来启停yarn和HDFS,也可以使用start/stop-all.sh来启停所有节点上的服务,或者使用hadoop-daemon.sh启停指定节点上的特定服务,这里使用start-all.sh启动所有节点上的服务:

start-all.sh

注:在启动过程中,输出的日志会显示启动的服务的过程,并且会将日志以*.out保存在特定的目录下,如果发现有特定的服务没有启动成功,可以查看日志来进行排错。

11)查看运行情况。启动完成后,使用jps命令可以看到相关的运行的进程。因为服务不同,不同节点上进程是不同的:

NameNode 10.0.1.100:[hadoop@hadoop-test-nn ~]$ jps4226 NameNode4487 ResourceManager9796 Jps10.0.1.101 SecondaryNameNode:[hadoop@hadoop-test-snn ~]$ jps4890 Jps31518 SecondaryNameNode10.0.1.102 DataNode:[hadoop@hadoop-test-dn1 ~]$ jps31421 DataNode2888 Jps31532 NodeManager10.0.1.103 DataNode:[hadoop@hadoop-test-dn2 ~]$ jps29786 DataNode29896 NodeManager1164 Jps

至此,Hadoop完全分布式环境搭建完成。


12)运行测试程序

可以使用提供的mapreduce示例程序wordcount来验证hadoop环境是否正常运行,该程序被包含在$HADOOP_HOME/share/hadoop/mapreduce/目录下的hadoop-mapreduce-examples-2.6.5.jar包中,使用命令格式为

hadoop jar hadoop-mapreduce-examples-2.6.5.jar wordcount <输入文件> [<输入文件>...] <输出目录>

首先上传一个文件到HDFS的/test_wordcount目录下,这里采用/etc/profile进行测试:

#在hdfs上建立/test_wordcount目录[hadoop@hadoop-test-nn mapreduce]$ hdfs dfs -mkdir /test_wordcount#将/etc/profile上传到/test_wordcount目录下[hadoop@hadoop-test-nn mapreduce]$ hdfs dfs -put /etc/profile /test_wordcount[hadoop@hadoop-test-nn mapreduce]$ hdfs dfs -ls /test_wordcount             Found 1 items-rw-r--r--   2 hadoop supergroup       2064 2017-08-06 21:28 /test_wordcount/profile#使用wordcount程序进行测试[hadoop@hadoop-test-nn mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.6.5.jar wordcount /test_wordcount/profile /test_wordcount_out17/08/06 21:30:11 INFO client.RMProxy: Connecting to ResourceManager at hadoop-test-nn/10.0.1.100:804017/08/06 21:30:13 INFO input.FileInputFormat: Total input paths to process : 117/08/06 21:30:13 INFO mapreduce.JobSubmitter: number of splits:117/08/06 21:30:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1501950606475_000117/08/06 21:30:14 INFO impl.YarnClientImpl: Submitted application application_1501950606475_000117/08/06 21:30:14 INFO mapreduce.Job: The url to track the job: http://hadoop-test-nn:8088/proxy/application_1501950606475_0001/17/08/06 21:30:14 INFO mapreduce.Job: Running job: job_1501950606475_000117/08/06 21:30:29 INFO mapreduce.Job: Job job_1501950606475_0001 running in uber mode : false17/08/06 21:30:29 INFO mapreduce.Job:  map 0% reduce 0/08/06 21:30:39 INFO mapreduce.Job:  map 100% reduce 0/08/06 21:30:49 INFO mapreduce.Job:  map 100% reduce 100/08/06 21:30:50 INFO mapreduce.Job: Job job_1501950606475_0001 completed successfully17/08/06 21:30:51 INFO mapreduce.Job: Counters: 49        File System Counters                FILE: Number of bytes read=2320                FILE: Number of bytes written=219547                FILE: Number of read operations=0                FILE: Number of large read operations=0                FILE: Number of write operations=0                HDFS: Number of bytes read=2178                HDFS: Number of bytes written=1671                HDFS: Number of read operations=6                HDFS: Number of large read operations=0                HDFS: Number of write operations=2        Job Counters                 Launched map tasks=1                Launched reduce tasks=1                Data-local map tasks=1                Total time spent by all maps in occupied slots (ms)=7536                Total time spent by all reduces in occupied slots (ms)=8136                Total time spent by all map tasks (ms)=7536                Total time spent by all reduce tasks (ms)=8136                Total vcore-milliseconds taken by all map tasks=7536                Total vcore-milliseconds taken by all reduce tasks=8136                Total megabyte-milliseconds taken by all map tasks=7716864                Total megabyte-milliseconds taken by all reduce tasks=8331264        Map-Reduce Framework                Map input records=84                Map output records=268                Map output bytes=2880                Map output materialized bytes=2320                Input split bytes=114                Combine input records=268                Combine output records=161                Reduce input groups=161                Reduce shuffle bytes=2320                Reduce input records=161                Reduce output records=161                Spilled Records=322                Shuffled Maps =1                Failed Shuffles=0                Merged Map outputs=1                GC time elapsed (ms)=186                CPU time spent (ms)=1850                Physical memory (bytes) snapshot=310579200                Virtual memory (bytes) snapshot=1682685952                Total committed heap usage (bytes)=164630528        Shuffle Errors                BAD_ID=0                CONNECTION=0                IO_ERROR=0                WRONG_LENGTH=0                WRONG_MAP=0                WRONG_REDUCE=0        File Input Format Counters                 Bytes Read=2064        File Output Format Counters                 Bytes Written=1671

检查输出日志,没有错误产生,在/test_wordcount_out目录下查看结果:

[hadoop@hadoop-test-nn mapreduce]$ hdfs dfs -ls /test_wordcount_outFound 2 items-rw-r--r--   2 hadoop supergroup          0 2017-08-06 21:30 /test_wordcount_out/_SUCCESS-rw-r--r--   2 hadoop supergroup       1671 2017-08-06 21:30 /test_wordcount_out/part-r-00000[hadoop@hadoop-test-nn mapreduce]$ hdfs dfs -cat /test_wordcount_out/part-r-00000


0