千家信息网

(6)ceph集群osd down 故障处理

发表于:2025-01-23 作者:千家信息网编辑
千家信息网最后更新 2025年01月23日,(1)查看集群状态,发现2个osd 状态为down[root@node140 /]# ceph -s cluster: id: 58a12719-a5ed-4f95-b312-6ef
千家信息网最后更新 2025年01月23日(6)ceph集群osd down 故障处理

(1)查看集群状态,发现2个osd 状态为down

[root@node140 /]# ceph -s   cluster:    id:     58a12719-a5ed-4f95-b312-6efd6e34e558    health: HEALTH_ERR            noout flag(s) set            2 osds down            1 scrub errors            Possible data damage: 1 pg inconsistent            Degraded data redundancy: 1633/10191 objects degraded (16.024%), 84 pgs degraded, 122 pgs undersized  services:    mon: 2 daemons, quorum node140,node142 (age 3d)    mgr: admin(active, since 3d), standbys: node140    osd: 18 osds: 16 up (since 3d), 18 in (since 5d)         flags noout  data:    pools:   2 pools, 384 pgs    objects: 3.40k objects, 9.8 GiB    usage:   43 GiB used, 8.7 TiB / 8.7 TiB avail    pgs:     1633/10191 objects degraded (16.024%)             261 active+clean             84  active+undersized+degraded             38  active+undersized             1   active+clean+inconsistent

(2)查看osd状态

[root@node140 /]# ceph  osd treeID CLASS WEIGHT  TYPE NAME        STATUS REWEIGHT PRI-AFF -1       9.80804 root default                             -2       3.26935     host node140                          0   hdd 0.54489         osd.0        up  1.00000 1.00000  1   hdd 0.54489         osd.1        up  1.00000 1.00000  2   hdd 0.54489         osd.2        up  1.00000 1.00000  3   hdd 0.54489         osd.3        up  1.00000 1.00000  4   hdd 0.54489         osd.4        up  1.00000 1.00000  5   hdd 0.54489         osd.5        up  1.00000 1.00000 -3       3.26935     host node141                         12   hdd 0.54489         osd.12       up  1.00000 1.00000 13   hdd 0.54489         osd.13       up  1.00000 1.00000 14   hdd 0.54489         osd.14       up  1.00000 1.00000 15   hdd 0.54489         osd.15       up  1.00000 1.00000 16   hdd 0.54489         osd.16       up  1.00000 1.00000 17   hdd 0.54489         osd.17       up  1.00000 1.00000 -4       3.26935     host node142                          6   hdd 0.54489         osd.6        up  1.00000 1.00000  7   hdd 0.54489         osd.7      down  1.00000 1.00000  8   hdd 0.54489         osd.8      down  1.00000 1.00000  9   hdd 0.54489         osd.9        up  1.00000 1.00000 10   hdd 0.54489         osd.10       up  1.00000 1.00000 11   hdd 0.54489         osd.11       up  1.00000 1.00000 

(3)osd 7 osd 8状态查看,已经failed了,重启也无法启动

[root@node140 /]# systemctl status ceph-osd@8.service● ceph-osd@8.service - Ceph object storage daemon osd.8   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; vendor preset: disabled)   Active: failed (Result: start-limit) since Fri 2019-08-30 17:36:50 CST; 1min 20s ago  Process: 433642 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=1/FAILURE)Aug 30 17:36:50 node140 systemd[1]: Failed to start Ceph object storage daemon osd.8.Aug 30 17:36:50 node140 systemd[1]: Unit ceph-osd@8.service entered failed state.Aug 30 17:36:50 node140 systemd[1]: ceph-osd@8.service failed.Aug 30 17:36:50 node140 systemd[1]: ceph-osd@8.service holdoff time over, scheduling restart.Aug 30 17:36:50 node140 systemd[1]: Stopped Ceph object storage daemon osd.8.Aug 30 17:36:50 node140 systemd[1]: start request repeated too quickly for ceph-osd@8.serviceAug 30 17:36:50 node140 systemd[1]: Failed to start Ceph object storage daemon osd.8.Aug 30 17:36:50 node140 systemd[1]: Unit ceph-osd@8.service entered failed state.Aug 30 17:36:50 node140 systemd[1]: ceph-osd@8.service failed.

(4)osd硬盘故障,状态变化
osd硬盘故障,状态变为down。在经过mod osd down out interval 设定的时间间隔后,ceph将其标记为out,并开始进行数据迁移恢复。 为了降低影响可以先关闭,待硬盘更换完成后再开启
[root@node140 /]# cat /etc/ceph/ceph.conf
[global]
mon osd down out interval = 900

(5)停止数据均衡
[root@node140 /]# for i in noout nobackfill norecover noscrub nodeep-scrub;do ceph osd set $i;done

(6)定位i故障盘
[root@node140 /]# ceph osd tree | grep -i down
7 hdd 0.54489 osd.7 down 0 1.00000
8 hdd 0.54489 osd.8 down 0 1.00000

(7)卸载故障的节点
[root@node142 ~]# umount /var/lib/ceph/osd/ceph-7
[root@node142 ~]# umount /var/lib/ceph/osd/ceph-8

(8)从crush map 中移除osd
[root@node142 ~]# ceph osd crush remove osd.7
removed item id 7 name 'osd.7' from crush map
[root@node142 ~]# ceph osd crush remove osd.8
removed item id 8 name 'osd.8' from crush map

(9)删除故障osd的密钥
[root@node142 ~]# ceph auth del osd.7
updated
[root@node142 ~]# ceph auth del osd.8
updated

(10)删除故障osd

[root@node142 ~]# ceph osd rm 7 removed osd.7[root@node142 ~]# ceph osd rm 8removed osd.8[root@node142 ~]# ceph osd tree ID CLASS WEIGHT  TYPE NAME        STATUS REWEIGHT PRI-AFF -1       8.71826 root default                             -2       3.26935     host node140                          0   hdd 0.54489         osd.0        up  1.00000 1.00000  1   hdd 0.54489         osd.1        up  1.00000 1.00000  2   hdd 0.54489         osd.2        up  1.00000 1.00000  3   hdd 0.54489         osd.3        up  1.00000 1.00000  4   hdd 0.54489         osd.4        up  1.00000 1.00000  5   hdd 0.54489         osd.5        up  1.00000 1.00000 -3       3.26935     host node141                         12   hdd 0.54489         osd.12       up  1.00000 1.00000 13   hdd 0.54489         osd.13       up  1.00000 1.00000 14   hdd 0.54489         osd.14       up  1.00000 1.00000 15   hdd 0.54489         osd.15       up  1.00000 1.00000 16   hdd 0.54489         osd.16       up  1.00000 1.00000 17   hdd 0.54489         osd.17       up  1.00000 1.00000 -4       2.17957     host node142                          6   hdd 0.54489         osd.6        up  1.00000 1.00000  9   hdd 0.54489         osd.9        up  1.00000 1.00000 10   hdd 0.54489         osd.10       up  1.00000 1.00000 11   hdd 0.54489         osd.11       up  1.00000 1.00000 [root@node142 ~]# 

(11)更换故障硬盘查看盘符,然后重建
[root@node142 ~]# ceph-volume lvm create --data /dev/sdd
[root@node142 ~]# ceph-volume lvm create --data /dev/sdc

(12)
[root@node142 ~]# ceph-volume lvm list
(13)待新osd添加crush map后,重新开启集群禁用标志
for i in noout nobackfill norecover noscrub nodeep-scrub;do ceph osd unset $i;done

0