k8s 某台节点状态频繁切换, PLEG is not healthy - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
Lunrry
V2EX    Kubernetes

k8s 某台节点状态频繁切换, PLEG is not healthy

  •  
  •   Lunrry 2023-09-15 10:16:09 +08:00 2388 次点击
    这是一个创建于 756 天前的主题,其中的信息可能已经有所发展或是发生改变。

    大佬们请教一个问题,公司 k8s 环境有两台 node 频繁的在 Ready 和 NotReady 状态切换,间隔大概 3 分钟。 版本信息:

    Kernel Version: 3.10.0-1062.el7.x86_64 OS Image: CentOS Linux 7 (Core) Operating System: linux Architecture: amd64 Container Runtime Version: docker://18.6.1 Kubelet Version: v1.14.1 Kube-Proxy Version: v1.14.1 

    节点信息

    Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 3312m (28%) 370m (3%) memory 24302Mi (19%) 270Mi (0%) ephemeral-storage 0 (0%) 0 (0%) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal NodeNotReady 9m10s (x12084 over 45d) kubelet, dev-11 Node dev-11 status is now: NodeNotReady Normal NodeReady 4m9s (x12086 over 65d) kubelet, dev-11 Node dev-11 status is now: NodeReady 

    日志信息

    9 月 15 10:06:47 dev-11 kubelet[2016]: I0915 10:06:47.940194 2016 setters.go:521] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2023-09-15 10:06:47.940166803 +0800 CST m=+5667448.191374429 LastTransitionTime:2023-09-15 10:06:47.940166803 +0800 CST m=+5667448.191374429 Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m4.847729429s ago; threshold is 3m0s.} 9 月 15 10:06:50 dev-11 kubelet[2016]: I0915 10:06:50.280321 2016 kubelet.go:1823] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m7.187849858s ago; threshold is 3m0s. ..... 9 月 15 10:07:40 dev-11 kubelet[2016]: I0915 10:07:40.281597 2016 kubelet.go:1823] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m57.189127759s ago; threshold is 3m0s. 9 月 15 10:07:43 dev-11 kubelet[2016]: E0915 10:07:43.124845 2016 remote_runtime.go:321] ContainerStatus "1f718a7646f7c8126e784*********************930620d33ab9bb" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded 9 月 15 10:07:43 dev-11 kubelet[2016]: E0915 10:07:43.124906 2016 kuberuntime_manager.go:917] getPodContainerStatuses for pod "test-jdk11-1-0_test1(13*****-1fe1-11ee-a143-f4******bb5)" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded 9 月 15 10:08:42 dev-11 kubelet[2016]: E0915 10:08:42.995808 2016 kubelet_pods.go:1093] Failed killing the pod "test-jdk11-1-0": failed to "KillContainer" for "jdk11" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded" 9 月 15 10:09:01 dev-11 kubelet[2016]: E0915 10:09:01.488058 2016 remote_runtime.go:402] Exec 1f718a7646f7c8126e784*********************930620d33ab9bb '/bin/sh' from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded 9 月 15 10:09:44 dev-11 kubelet[2016]: E0915 10:09:44.151795 2016 remote_runtime.go:321] ContainerStatus "1f718a7646f7c8126e784*********************930620d33ab9bb" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded 9 月 15 10:09:44 dev-11 kubelet[2016]: E0915 10:09:44.151843 2016 kuberuntime_manager.go:917] getPodContainerStatuses for pod "test-jdk11-1-0_test1(13*****-1fe1-11ee-a143-f4******bb5)" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded 9 月 15 10:10:05 dev-11 kubelet[2016]: E0915 10:10:05.742413 2016 remote_runtime.go:402] Exec 1f718a7646f7c8126e784*********************930620d33ab9bb '/bin/sh' from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded 
    16 条回复    2024-05-16 06:56:15 +08:00
    yuan1028
        1
    yuan1028  
       2023-09-15 11:16:30 +08:00
    可以排查下 runtime ( docker 或 kata )的日志试试,看着像是 runtime 有问题
    Pythondr
        2
    Pythondr  
       2023-09-15 11:24:51 +08:00
    有僵尸容器 test-jdk11-1-0 ,docker 处理这个容器的时候超时了,导致 docker cli hang 住了,强制重启 docker 可以解决。造成这种情况的主要原因是你的容器代码写的有问题,无法正常退出。
    julyclyde
        3
    julyclyde  
       2023-09-15 12:13:32 +08:00
    @Pythondr 如果是纯软件的话,写的再有问题也可以 SIGKILL 吧?
    我都怀疑是不是硬盘坏了
    Cola98
        4
    Cola98  
       2023-09-15 13:43:57 +08:00
    感觉像是网络问题
    Lunrry
        5
    Lunrry  
    OP
       2023-09-15 13:46:54 +08:00
    我看样子也觉得是这个容器的问题,同事反映有时 kubectl exec 进入容器很慢,但是也还是能进入到这个容器,我只是一个 k8s 小白,不太了解里面的具体情况
    Lunrry
        6
    Lunrry  
    OP
       2023-09-15 13:54:12 +08:00
    @Cola98 #4 出问题的有两个 node 看日志都是这个原因
    ```
    9 月 15 13:51:46 dev-11 kubelet[2016]: I0915 13:51:46.181789 2016 setters.go:521] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2023-09-15 13:51:46.181760277 +0800 CST m=+5680946.432967904 LastTransitionTime:2023-09-15 13:51:46.181760277 +0800 CST m=+5680946.432967904 Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m5.432099472s ago; threshold is 3m0s.}
    9 月 15 13:51:47 dev-11 kubelet[2016]: I0915 13:51:47.280267 2016 kubelet.go:1823] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m6.530576287s ago; threshold is 3m0s.
    9 月 15 13:51:52 dev-11 kubelet[2016]: I0915 13:51:52.280410 2016 kubelet.go:1823] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m11.530719718s ago; threshold is 3m0s.
    ```
    网络用的 cni 插件,其他几十台没问题
    Cola98
        7
    Cola98  
       2023-09-15 17:27:42 +08:00
    @Lunrry 不好意思,之前没有看仔细
    Lunrry
        8
    Lunrry  
    OP
       2023-09-15 17:54:29 +08:00   1
    现在排查出来了,90%可能是 jdk 容器的问题。
    在每次迭代中,PLEG 运行状况检查都会调用`docker ps`来检测容器状态更改并`docker inspect`获取这些容器的详细信息。每次迭代完成后,它都会更新时间戳。如果时间戳一段时间(即 3 分钟)没有更新,则健康检查失败。
    使用脚本循环执行`docker inspect`命令时,出问题的 dev-11 和 dev-13 两台机器都卡在 jdk 这个容器上,3 分钟时间一过,检查超时就被判定为 NotReady 状态了。
    现在就差去调试容器编排脚本了
    hancai
        9
    hancai  
       2023-09-20 11:13:47 +08:00
    hancai
        10
    hancai  
       2023-09-20 11:16:51 +08:00
    @hancai 不小心恢复错了。 pleg 问题大概率都是内核 bug , 遇到好多次了
    Lunrry
        11
    Lunrry  
    OP
       2023-09-20 14:05:55 +08:00
    @hancai #10 可是其他机器装的同样的系统内核版本也相同,也可能会导致这个问题吗
    hancai
        12
    hancai  
       2023-09-20 16:18:22 +08:00
    同内核同集群也只是部分节点出现, 大概率内核日志中持续打印 unregister_netdevice: waiting for XXX to become free. Usage count = 1 。 还有集群中有 pod 一直处于 terminating 中。
    hancai
        13
    hancai  
       2023-09-20 16:23:38 +08:00
    你搜一下这两篇博客 “内核 bug 修复方案:网络设备引用计数泄” “记一次 k8s 集群 pod 一直 terminating 问题的排查” , 如果故障现象差不多就是内核问题了, 今年遇到两个 k8s 集群都是这个问题。docker inspect 卡住也遇到过, 不过最终都是升级内核才修复。sandbox 容器没有正常销毁也会出现这个问题。
    Lunrry
        14
    Lunrry  
    OP
       2023-09-21 08:58:58 +08:00
    @hancai #13 好的去,我去看看,谢谢大佬指教
    yiyu1211
        15
    yiyu1211  
       2023-11-17 09:49:51 +08:00
    后面是 jdk 容器什么问题呢?
    DavidWei
        16
    DavidWei  
       2024-05-16 06:56:15 +08:00 via Android
    升级内核可以解决
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     2618 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 25ms UTC 15:18 PVG 23:18 LAX 08:18 JFK 11:18
    Do have faith in what you're doing.
    ubao snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86