ceph集群一个节点上osd持续无心跳

软件如下centos7.4、ceph12.2.2、pike。
3个管理节点和15个超融合节点,每节点30osd。
部署后发现一个问题,稍有写入就会出现slow request。最后排查到是其中一个节点全部osd经常性心跳检测不正常。全部节点部署一样软件版本。
检查了以下东西
1.防火墙关闭,selinux关闭
2.iptables和其他节点一样
3.到管理节点和其他节点ping正常
4.其他节点到这个节点,这个节点到其他节点osd端口都能联通 ,用nc命令
5.网络是万兆,集群没负载。测试其他节点都没有出现osd大量没心跳,偶尔有个别。
6.硬盘检查正常。配置都是同一批采购。
7.交换机也正常,无特殊报警

麻烦哪位有经验的指导排查一下。现在解决办法是暂时把这个节点从集群移除。 移除后用几个虚拟机跑fio也不会有slow request。其他节点也不会有osd心跳不正常。

Feb 22 05:02:45 compute13 ceph-osd[63389]: 2018-02-22 05:02:45.671581 7fbf2fcab700 -1 osd.360 28314 heartbeat_check: no reply from 10.1.0.6:6857 osd.167 since back 2018-02-22 05:02:08.096809 front 2018-02-22 05:02:44.742998 (cutoff 2018-02-22 05:02:25.671570)
Feb 22 05:03:39 compute13 ceph-osd[63389]: 2018-02-22 05:03:39.678850 7fbf2fcab700 -1 osd.360 28314 heartbeat_check: no reply from 10.1.0.4:6837 osd.97 since back 2018-02-22 05:03:18.989375 front 2018-02-22 05:03:35.815439 (cutoff 2018-02-22 05:03:19.678837)
Feb 22 05:03:40 compute13 ceph-osd[63389]: 2018-02-22 05:03:40.679085 7fbf2fcab700 -1 osd.360 28314 heartbeat_check: no reply from 10.1.0.4:6837 osd.97 since back 2018-02-22 05:03:18.989375 front 2018-02-22 05:03:40.519456 (cutoff 2018-02-22 05:03:20.679081)
Feb 22 05:03:41 compute13 ceph-osd[63389]: 2018-02-22 05:03:41.679249 7fbf2fcab700 -1 osd.360 28314 heartbeat_check: no reply from 10.1.0.4:6837 osd.97 since back 2018-02-22 05:03:18.989375 front 2018-02-22 05:03:40.519456 (cutoff 2018-02-22 05:03:21.679244)
Feb 22 05:03:42 compute13 ceph-osd[63389]: 2018-02-22 05:03:42.679423 7fbf2fcab700 -1 osd.360 28314 heartbeat_check: no reply from 10.1.0.4:6837 osd.97 since back 2018-02-22 05:03:18.989375 front 2018-02-22 05:03:40.519456 (cutoff 2018-02-22 05:03:22.679415)
Feb 22 05:03:43 compute13 ceph-osd[63389]: 2018-02-22 05:03:43.679602 7fbf2fcab700 -1 osd.360 28314 heartbeat_check: no reply from 10.1.0.4:6837 osd.97 since back 2018-02-22 05:03:18.989375 front 2018-02-22 05:03:40.519456 (cutoff 2018-02-22 05:03:23.679594)
Feb 22 05:03:44 compute13 ceph-osd[63389]: 2018-02-22 05:03:44.679774 7fbf2fcab700 -1 osd.360 28314 heartbeat_check: no reply from 10.1.0.4:6837 osd.97 since back 2018-02-22 05:03:18.989375 front 2018-02-22 05:03:44.624224 (cutoff 2018-02-22 05:03:24.679770)
Feb 22 05:03:45 compute13 ceph-osd[63389]: 2018-02-22 05:03:45.679947 7fbf2fcab700 -1 osd.360 28314 heartbeat_check: no reply from 10.1.0.4:6837 osd.97 since back 2018-02-22 05:03:18.989375 front 2018-02-22 05:03:44.624224 (cutoff 2018-02-22 05:03:25.679939)
Feb 22 05:03:46 compute13 ceph-osd[63389]: 2018-02-22 05:03:46.680135 7fbf2fcab700 -1 osd.360 28314 heartbeat_check: no reply from 10.1.0.4:6837 osd.97 since back 2018-02-22 05:03:18.989375 front 2018-02-22 05:03:44.624224 (cutoff 2018-02-22 05:03:26.680127)
Feb 22 05:03:47 compute13 ceph-osd[63389]: 2018-02-22 05:03:47.680299 7fbf2fcab700 -1 osd.360 28314 heartbeat_check: no reply from 10.1.0.4:6837 osd.97 since back 2018-02-22 05:03:18.989375 front 2018-02-22 05:03:44.624224 (cutoff 2018-02-22 05:03:27.680290)
2018-02-22 05:55 添加评论 分享
已邀请:
0

hl10502

赞同来自:

可以从以下来排查:
1、看进程是否正常启动;
2、检查时钟同步;
3、检查网络;
4、关闭防火墙和selinux

要回复问题请先登录注册

退出全屏模式 全屏模式 回复