千家信息网

Yarn上的不健康节点UNHEALTHY nodes怎么处理

发表于:2025-02-02 作者:千家信息网编辑
千家信息网最后更新 2025年02月02日,小编给大家分享一下Yarn上的不健康节点UNHEALTHY nodes怎么处理,希望大家阅读完这篇文章之后都有所收获,下面让我们一起去探讨吧!一、错误自己的三台虚拟机hadoop001、hadoop0
千家信息网最后更新 2025年02月02日Yarn上的不健康节点UNHEALTHY nodes怎么处理

小编给大家分享一下Yarn上的不健康节点UNHEALTHY nodes怎么处理,希望大家阅读完这篇文章之后都有所收获,下面让我们一起去探讨吧!

一、错误

自己的三台虚拟机hadoop001、hadoop002、hadoop003

检查23188 发现有Unhealthy Nodes,正常的active nodes数目不对

另外查看

$ yarn node -list -allTotal Nodes:4         Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers hadoop001:34354              UNHEALTHY   hadoop001:23999                                  0 hadoop002:60027                RUNNING   hadoop002:23999                                  0 hadoop001:50623              UNHEALTHY   hadoop001:23999                                  0 hadoop003:39700              UNHEALTHY   hadoop003:23999                                  0

二、日志检查

查看resourcemanager的日志可以看到

2016-09-10 12:02:05,953 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added node hadoop002:60027 cluster capacity: 2016-09-10 12:02:05,990 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Node hadoop001:50623 reported UNHEALTHY with details: 1/1 local-dirs are bad: /data/disk1/data/yarn/local; 1/1 log-dirs are bad: /opt/beh/logs/yarn/userlog2016-09-10 12:02:05,991 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: hadoop001:50623 Node Transitioned from RUNNING to UNHEALTHY2016-09-10 12:02:05,993 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node hadoop001:50623 cluster capacity: 2016-09-10 12:02:06,378 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved hadoop003 to /default-rack

检查nodemanager的日志可以查看到

2016-09-10 12:02:02,869 INFO org.mortbay.log: jetty-6.1.26.cloudera.42016-09-10 12:02:02,905 INFO org.mortbay.log: Extract jar:file:/opt/beh/core/hadoop/share/hadoop/yarn/hadoop-yarn-common-2.6.0-cdh6.4.4.jar!/webapps/node to /tmp/Jetty_0_0_0_0_23999_node____tgfx6h/webapp2016-09-10 12:02:03,242 INFO org.mortbay.log: Started HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:239992016-09-10 12:02:03,242 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app /node started at 239992016-09-10 12:02:03,735 INFO org.apache.hadoop.yarn.webapp.WebApps: Registered webapp guice modules2016-09-10 12:02:03,775 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out 0 NM container statuses: []2016-09-10 12:02:03,783 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registering with RM using containers :[]2016-09-10 12:02:03,822 INFO org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over to rm22016-09-10 12:02:03,824 INFO org.apache.hadoop.io.retry.RetryInvocationHandler: Exception while invoking registerNodeManager of class ResourceTrackerPBClientImpl over rm2 after 1 fail over attempts. Trying to fail over after sleeping for 2138ms.java.net.ConnectException: Call From hadoop002/192.168.30.22 to hadoop002:23125 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)        at org.apache.hadoop.ipc.Client.call(Client.java:1472)        at org.apache.hadoop.ipc.Client.call(Client.java:1399)        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)        at com.sun.proxy.$Proxy27.registerNodeManager(Unknown Source)        at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:68)        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)        at java.lang.reflect.Method.invoke(Method.java:606)        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)        at com.sun.proxy.$Proxy28.registerNodeManager(Unknown Source)        at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:257)        at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:191)        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)        at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:264)        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:463)        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:509)Caused by: java.net.ConnectException: 拒绝连接        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)        at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)        at org.apache.hadoop.ipc.Client.call(Client.java:1438)        ... 19 more2016-09-10 12:02:05,965 INFO org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over to rm12016-09-10 12:02:05,996 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager: Rolling master-key for container-tokens, got key with id -15135375062016-09-10 12:02:05,998 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM: Rolling master-key for container-tokens, got key with id 7019207212016-09-10 12:02:05,999 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as hadoop002:60027 with total resource of 2016-09-10 12:02:05,999 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Notifying ContainerManager to unblock new container-requests

三、错误分析

NodeManager默认会每两分钟检查本地磁盘(local-dirs),找出那些目录可以使用。注意如果判定这个磁盘不可用,则在重启 NodeManager之前,就算磁盘好了,也不会把它变成可用。当好磁盘数少于一定量时,会把这台机器变成unhealthy,将不会再给这台机器分配任务。

查看自己的虚拟机磁盘情况,发现001和003的磁盘都要满了,于是清除不需要的文件,腾出剩余空间,UNHEALTHY nodes立马恢复正常

$  yarn node -list -allTotal Nodes:4         Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers hadoop001:34354                RUNNING   hadoop001:23999                                  0 hadoop002:60027                RUNNING   hadoop002:23999                                  0 hadoop003:39700                RUNNING   hadoop003:23999                                  0 hadoop001:50623                   LOST   hadoop001:23999                                  0

此处为什么有2个hadoop001,因为修改了配置文件重启过一次,所有出现了2个,其中有一个为LOST状态,另一个正常RUNNING,不影响使用,yarn重启后就可恢复正常。

看完了这篇文章,相信你对"Yarn上的不健康节点UNHEALTHY nodes怎么处理"有了一定的了解,如果想了解更多相关知识,欢迎关注行业资讯频道,感谢各位的阅读!

0