千家信息网

elasticsearch red 无法修复,查看发现是unassigned shards 有无法分配的节点的分片

发表于:2025-01-23 作者:千家信息网编辑
千家信息网最后更新 2025年01月23日,一、问题:集群 red 后,所有节点全部重启恢复不好,查看节点下的数据目录,发现对应索引目录下没有文件参照博客: http://www.wklken.me/posts/2015/05/23/elast
千家信息网最后更新 2025年01月23日elasticsearch red 无法修复,查看发现是unassigned shards 有无法分配的节点的分片

一、问题:集群 red 后,所有节点全部重启恢复不好,查看节点下的数据目录,发现对应索引目录下没有文件

参照博客: http://www.wklken.me/posts/2015/05/23/elasticsearch-issues.html

(还有说法,再加一个节点就会自动分配,我加了节点发现此方法行不通,以下方法操作后,集群状态就 green 了)


通过一系列排查,发现是因为有4个分片未分配到节点上,重启后还无法分配,通过 head 插件可以看到无法分配节点的分片,通过以下命令也可以看到 unassigned shards 有4个无法分配

curl http://192.168.224.188:9200/_cluster/health\?pretty


{

"cluster_name" : "gag-prod",

"status" : "red",

"timed_out" : false,

"number_of_nodes" : 3,

"number_of_data_nodes" : 3,

"active_primary_shards" : 233,

"active_shards" : 466,

"relocating_shards" : 0,

"initializing_shards" : 0,

"unassigned_shards" : 4, \\ 这个就是

"delayed_unassigned_shards" : 0,

"number_of_pending_tasks" : 0,

"number_of_in_flight_fetch" : 0,

"task_max_waiting_in_queue_millis" : 0,

"active_shards_percent_as_number" : 99.14893617021276

}


curl http://192.168.224.188:9200/_cat/shards 从这里找到 UNASSIGNED 类型的索引名字。(和从 head 插件里看到的剩余的分片名字一样)


items22 4 p STARTED 2273 571.1kb 192.168.224.187 gag-prod-node-187

items22 4 r STARTED 2273 571.1kb 192.168.224.188 gag-prod-node-188

items22 2 p UNASSIGNED

items22 2 r UNASSIGNED

items22 1 p STARTED 2284 555.2kb 192.168.224.187 gag-prod-node-187

items22 1 r STARTED 2284 555.2kb 192.168.224.188 gag-prod-node-188

items22 3 p STARTED 2276 641.5kb 192.168.224.187 gag-prod-node-187

items22 3 r STARTED 2276 641.5kb 192.168.224.188 gag-prod-node-188

items22 0 p UNASSIGNED

items22 0 r UNASSIGNED

shop_entity7 4 p STARTED 53 29.6kb 192.168.224.187 gag-prod-node-187



curl http://192.168.224.188:9200_nodes/process?pretty 查看 master节点的唯一标识


{

"cluster_name" : "gag-prod",

"nodes" : {

"tdp1G9DbRseQm8xS9v8jng" : { \\这个是 187 节点的唯一标识

"name" : "gag-prod-node-187",

"transport_address" : "192.168.224.187:9300",

"host" : "192.168.224.187",

"ip" : "192.168.224.187",

"version" : "2.3.2",

"build" : "b9e4a6a",

"http_address" : "192.168.224.187:9200",

"attributes" : {

"master" : "true"

},

"process" : {

"refresh_interval_in_millis" : 1000,

"id" : 10009,

"mlockall" : false

}

},

"a6tktPPYSCOGv4uw8uRclg" : { \\这个是 186 节点的唯一标识

"name" : "gag-prod-node-186",

"transport_address" : "192.168.224.186:9300",

"host" : "192.168.224.186",

"ip" : "192.168.224.186",

"version" : "2.3.2",

"build" : "b9e4a6a",

"http_address" : "192.168.224.186:9200",

"attributes" : {

"master" : "false"

},

"process" : {

"refresh_interval_in_millis" : 1000,

"id" : 24049,

"mlockall" : false

}

},

"d5DvDdr6SLak8YCC099jRg" : { \\这个是 188 节点的唯一标识

"name" : "gag-prod-node-188",

"transport_address" : "192.168.224.188:9300",

"host" : "192.168.224.188",

"ip" : "192.168.224.188",

"version" : "2.3.2",

"build" : "b9e4a6a",

"http_address" : "192.168.224.188:9200",

"attributes" : {

"master" : "true"

},

"process" : {

"refresh_interval_in_millis" : 1000,

"id" : 13058,

"mlockall" : false

}

}

}

}


通过以上操作我们已经找到了 "问题分片"、"节点唯一标识",现在我们就可以强制把问题分片分配到其中一个节点上了。下面我们将问题分片分到 gag-prod-node-187 上


编辑脚本:(如果有很多unassigned shards,那么可以写循环脚本)


#!/bin/bash

# 将 items22 0 强制分配到 gag-prod-node-187(tdp1G9DbRseQm8xS9v8jng)

curl -XPOST '192.168.224.187:9200/_cluster/reroute' -d '{

"commands" : [ {

"allocate" : {

"index" : "items22",

"shard" : 0,

"node" : "tdp1G9DbRseQm8xS9v8jng",

"allow_primary" : true

}

}

]

}'

# 将 items22 2 强制分配到 gag-prod-node-187(tdp1G9DbRseQm8xS9v8jng)

curl -XPOST '192.168.224.187:9200/_cluster/reroute' -d '{

"commands" : [ {

"allocate" : {

"index" : "items22",

"shard" : 2,

"node" : "tdp1G9DbRseQm8xS9v8jng",

"allow_primary" : true

}

}

]

}'


运行完此脚本后,再查看集群状态,已经恢复,等到此分片自动备份到另一个节点上后,停止 gag-prod-node-187 节点,分片已经可以自动分片节点。


0