redis 集群 CLUSTERDOWN Hash slot not served 报错解决思路

本想深夜直接洗洗睡了,但是越想越来气,redis 不熟悉用的是真的头大,无奈写篇博客记录下自己的排错思路(其实就是各种百度啦~~)

背景:  因机房故障,导致redis-cluster 集群中 有个节点 down 了 ,再从新拉起节点加入集群后,就没在意

过来小许,服务器磁盘告警来了,我看了下 应用服务器上的tomcat 的日志,不停的刷,日志内容 报redis 错误, CLUSTERDOWN Hash slot not served

redis  集群  CLUSTERDOWN Hash slot not served 报错解决思路

吓得我 立马 去 redis 集群上去 检查 是否是我的刚刚加入的节点有问题

[root@admin ~]# redis-cli -c -h 192.168.207.251 -p 7001

192.168.207.251:7001>set testfxkj   6666

(error) CLUSTERDOWN Hash slot not served

192.168.207.251:7001>exit

提示hash 槽有问题 ;

退出redis集群 ,使用 redis-trib.rb check  对集群进行检测

备注: 我这里是redis 3.2 版本 ,使用 redis-trib.rb check 进行检测

            如果是 redis 3 版本以上 请使用 redis-cli –cluster check 命令

[root@admin ~]# redis-trib.rb check 192.168.207.251:7001

检测后 提示 如下图, [ERR] Not all 16384 slots are covered by nodes

redis  集群  CLUSTERDOWN Hash slot not served 报错解决思路

可以看到 16384号slots没有被分配

下一步,我们尝试使用 redis 官方推荐的 redis-trib.rb fix 命令 来修复集群(同样 redis 3版本以上的请使用redis-cli –cluster fix 进行修复  )

[root@admin ~]# redis-trib.rb fix 192.168.207.251:7001
>>> Performing Cluster Check (using node 192.168.207.251:7001)
M: 3d1808df8a82dd705a67e8fef26cb8c8482bd69a 192.168.207.251:7001
   slots:0-5459 (5460 slots) master
   1 additional replica(s)
S: 6fa8513ee71e0153e2e746fef2a3b1ce08348766 192.168.207.159:7003
   slots: (0 slots) slave
   replicates a10cf2bdaf1c445c34d6ae447a2f79bc5d4e9c32
M: 36dc79859c644cb03e6c1b0ec4a2d8bf56cfa39c 192.168.207.160:7001
   slots:10923-16383 (5461 slots) master
   1 additional replica(s)
S: d057d13bc64b9be9f68c0291700965b3751127c0 192.168.207.159:7005
   slots: (0 slots) slave
   replicates 36dc79859c644cb03e6c1b0ec4a2d8bf56cfa39c
M: a10cf2bdaf1c445c34d6ae447a2f79bc5d4e9c32 192.168.207.160:7002
   slots:5461-10922 (5462 slots) master
   1 additional replica(s)
S: a84cb46e811925ee48f0ff8c25e753be752fc45c 192.168.207.159:7004
   slots: (0 slots) slave
   replicates 3d1808df8a82dd705a67e8fef26cb8c8482bd69a
[ERR] Nodes don't agree about configuration!
>>> Check for open slots...
>>> Check slots coverage...
[ERR] Not all 16384 slots are covered by nodes.
>>> Fixing slots coverage...
List of not covered slots: 5460
Slot 5460 has keys in 0 nodes: 
The folowing uncovered slots have no keys across the cluster:
5460
Fix these slots by covering with a random node? (type 'yes' to accept): yes
>>> Covering slot 6829 with 192.168.207.251:7001
/usr/local/ruby/lib/ruby/gems/2.2.0/gems/redis-3.2.2/lib/redis/client.rb:114:in `call': ERR Slot 6829 is already busy (Redis::CommandError)
        from /usr/local/ruby/lib/ruby/gems/2.2.0/gems/redis-3.2.2/lib/redis.rb:2646:in `block in method_missing'
        from /usr/local/ruby/lib/ruby/gems/2.2.0/gems/redis-3.2.2/lib/redis.rb:57:in `block in synchronize'
        from /usr/local/ruby/lib/ruby/2.2.0/monitor.rb:211:in `mon_synchronize'
        from /usr/local/ruby/lib/ruby/gems/2.2.0/gems/redis-3.2.2/lib/redis.rb:57:in `synchronize'
        from /usr/local/ruby/lib/ruby/gems/2.2.0/gems/redis-3.2.2/lib/redis.rb:2645:in `method_missing'
        from /usr/local/bin/redis-trib.rb:462:in `block in fix_slots_coverage'
        from /usr/local/bin/redis-trib.rb:459:in `each'
        from /usr/local/bin/redis-trib.rb:459:in `fix_slots_coverage'
        from /usr/local/bin/redis-trib.rb:398:in `check_slots_coverage'
        from /usr/local/bin/redis-trib.rb:361:in `check_cluster'
        from /usr/local/bin/redis-trib.rb:1139:in `fix_cluster_cmd'
        from /usr/local/bin/redis-trib.rb:1695:in `<main>'

fix 修复失败,提示 slot 6829 正在使用

登录集群,CLUSTER DELSLOTS 6829  (使一个特定的Redis Cluster节点去忘记一个主节点正在负责的哈希槽)

[root@admin ~]# redis-cli -c -h 192.168.207.251 -p 7001

192.168.207.251:7001> CLUSTER DELSLOTS 6829

OK

再次 使用 redis-trib.rb fix  去修复集群

 

[root@admin ~]# redis-trib.rb fix 192.168.207.251:7001
>>> Performing Cluster Check (using node 192.168.207.251:7001)
M: 3d1808df8a82dd705a67e8fef26cb8c8482bd69a 192.168.207.251:7001
   slots:0-6829 (6829 slots) master
   1 additional replica(s)
S: 6fa8513ee71e0153e2e746fef2a3b1ce08348766 192.168.207.159:7003
   slots: (0 slots) slave
   replicates a10cf2bdaf1c445c34d6ae447a2f79bc5d4e9c32
M: 36dc79859c644cb03e6c1b0ec4a2d8bf56cfa39c 192.168.207.160:7001
   slots:10923-16383 (5461 slots) master
   1 additional replica(s)
S: d057d13bc64b9be9f68c0291700965b3751127c0 192.168.207.159:7005
   slots: (0 slots) slave
   replicates 36dc79859c644cb03e6c1b0ec4a2d8bf56cfa39c
M: a10cf2bdaf1c445c34d6ae447a2f79bc5d4e9c32 192.168.207.160:7002
   slots:5461-10922 (5462 slots) master
   1 additional replica(s)
S: a84cb46e811925ee48f0ff8c25e753be752fc45c 192.168.207.159:7004
   slots: (0 slots) slave
   replicates 3d1808df8a82dd705a67e8fef26cb8c8482bd69a
[ERR] Nodes don't agree about configuration!
>>> Check for open slots...
>>> Check slots coverage...
[ERR] Not all 16384 slots are covered by nodes.
>>> Fixing slots coverage...
List of not covered slots: 6829
Slot 5460 has keys in 0 nodes: 
The folowing uncovered slots have no keys across the cluster:
5460
Fix these slots by covering with a random node? (type 'yes' to accept): yes
>>> Covering slot 6829 with 192.168.207.251:7001

再次check

[root@admin ~]# redis-trib.rb fix 192.168.207.251:7001
>>> Performing Cluster Check (using node 192.168.207.251:7001)
M: 3d1808df8a82dd705a67e8fef26cb8c8482bd69a 192.168.207.251:7001
   slots:0-6829  (6829 slots) master
   1 additional replica(s)
S: 6fa8513ee71e0153e2e746fef2a3b1ce08348766 192.168.207.159:7003
   slots: (0 slots) slave
   replicates a10cf2bdaf1c445c34d6ae447a2f79bc5d4e9c32
M: 36dc79859c644cb03e6c1b0ec4a2d8bf56cfa39c 192.168.207.160:7001
   slots:10923-16383 (5461 slots) master
   1 additional replica(s)
S: d057d13bc64b9be9f68c0291700965b3751127c0 192.168.207.159:7005
   slots: (0 slots) slave
   replicates 36dc79859c644cb03e6c1b0ec4a2d8bf56cfa39c
M: a10cf2bdaf1c445c34d6ae447a2f79bc5d4e9c32 192.168.207.160:7002
   slots:5460-10922 (5463 slots) master
   1 additional replica(s)
S: a84cb46e811925ee48f0ff8c25e753be752fc45c 192.168.207.159:7004
   slots: (0 slots) slave
   replicates 3d1808df8a82dd705a67e8fef26cb8c8482bd69a
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.

发现 16384 个哈希槽全部分配好了

 

这时我们再登录集群测试写个KEY

[root@admin ~]# redis-cli -c -h 192.168.207.251 -p 7001

192.168.207.251:7001>set testfxkj   6666

ok

发现 成功写入了KEY

同时也看了下  tomcat 得到日志 也不再刷 CLUSTERDOWN Hash slot not served

至此,本篇结束 。

2020/12/2  凌晨   南京

 

 

本文版权归 飞翔沫沫情 作者所有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出 原文链接 如有问题, 可发送邮件咨询,转贴请注明出处:https://www.fxkjnj.com/2375/

发表评论

登录后才能评论

评论列表(1条)

  • 疯子摆摊卖回忆
    疯子摆摊卖回忆 2020年12月2日 下午7:02

    哎 每次遇到redis 集群故障 都很难解决, 用时一时爽 哈哈
    楼主 这个经验分享的不错 ,收藏了