Debug issue of OOM/pod restarting of in Kubernetes

简介: # Debug issue of OOM/Pod restarting of in Kubernetes Recently, customers reported a issue that pod keep restarting after pod migrating from node to node, and java process inside pod exit abnormally

Debug issue of OOM/Pod restarting of in Kubernetes

Recently, customers reported a issue that pod keep restarting after pod migrating from node to node, and java process inside pod exit abnormally.
After a couple of troubleshooting, we could figure out root cause OOM caused by LimitRange of namespace, kernel killed newly created process once memory of JVM request exceeds the default limit. In this article, I will explain the troubleshooting method step by step, it should be common PD method for most application OOM/pod restarting related in Kubernetes.

Failed to start java/jetty application

kubectl logs console-54dc5566b4-nq6gs -n test

jetty process failed to start without any log

Starting Jetty:
start-stop-daemon -S -p/var/run/ -croot -d/var/lib/jetty -b -m -a /usr/bin/java -- -Xms512m -Xmx1g -Djetty.logs=/usr/local/jetty/logs,Agility,NotAliCloud,NotPublicCloud -Djetty.home=/usr/local/jetty -Djetty.base=/var/lib/jetty -jar /usr/local/jetty/start.jar jetty.state=/var/lib/jetty/jetty.state jetty-started.xml start-log-file=/usr/local/jetty/logs/start.log
FAILED Thu Mar 14 09:43:55 UTC 2019
tail: cannot open '/var/lib/jetty/logs/*.log' for reading: No such file or directory
tail: no files remaining

console pod keep re-creating on k8s cluster

kubectl get events --all-namespaces | grep console

test      1m        1m        1         console-54dc5566b4-sx2r6.158bc8b1f2a076ce   Pod       spec.containers{console}   Normal    Killing   kubelet, k8s003   Killing container with id docker://console:Container failed liveness probe.. Container will be killed and recreated.
test      1m        6m        2         console-54dc5566b4-hx6wb.158bc86c4379c4e7   Pod       spec.containers{console}   Normal    Started   kubelet, k8s001   Started container
test      1m        6m        2         console-54dc5566b4-hx6wb.158bc86c355ab395   Pod       spec.containers{console}   Normal    Created   kubelet, k8s001   Created container
test      1m        6m        2         console-54dc5566b4-hx6wb.158bc86c2fe32c76   Pod       spec.containers{console}   Normal    Pulled    kubelet, k8s001   Container image "" already present on machine
test      1m        1m        1         console-54dc5566b4-hx6wb.158bc8b87083e752   Pod       spec.containers{console}   Normal    Killing   kubelet, k8s001   Killing container with id docker://console:Container failed liveness probe.. Container will be killed and recreated.

Determine an OOM from pod state

kubectl get pod console-54dc5566b4-hx6wb -n test -o yaml | grep reason -C5

        containerID: docker://90e5c9e618f3e745ebf510b8f215da3a165e3d03be58e0369e27c1773e75ef70
        exitCode: 137
        finishedAt: 2019-03-14T09:29:51Z
        reason: OOMKilled
        startedAt: 2019-03-14T09:24:51Z
    name: console
    ready: true
    restartCount: 3

kubectl get pod console-54dc5566b4-hx6wb -n test -o jsonpath='{.status.containerStatuses[].lastState}'

map[terminated:map[exitCode:137 reason:OOMKilled startedAt:2019-03-14T09:24:51Z finishedAt:2019-03-14T09:29:51Z containerID:docker://90e5c9e618f3e745ebf510b8f215da3a165e3d03be58e0369e27c1773e75ef70]]

Detect oom thru system log to validate assumption

Following error indicate an java oom caused by cgroup setting

# grep oom /var/log/messages

/var/log/messages:2019-03-14T09:15:17.541049+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949064] java invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=0, order=0, oom_score_adj=968
/var/log/messages:2019-03-14T09:15:17.541117+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949153]  [<ffffffff81191de4>] oom_kill_process+0x214/0x3f0
/var/log/messages:2019-03-14T09:15:17.541119+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949171]  [<ffffffff811f9481>] mem_cgroup_oom_synchronize+0x2f1/0x310
/var/log/messages:2019-03-14T09:15:17.541147+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.950571] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name

#grep oom /var/log/warn

2019-03-14T09:15:17.541049+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949064] java invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=0, order=0, oom_score_adj=968
2019-03-14T09:15:17.541117+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949153]  [<ffffffff81191de4>] oom_kill_process+0x214/0x3f0
2019-03-14T09:15:17.541119+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949171]  [<ffffffff811f9481>] mem_cgroup_oom_synchronize+0x2f1/0x310

Root cause:

kubectl get pod console-54dc5566b4-hx6wb -n test -o yaml | grep limits -A4

  memory: 512Mi
  memory: 256Mi

In this case, application console pod extends the limits setting from default limits of namespace test

kubectl describe pod console-54dc5566b4-hx6wb -n test | grep limit

Annotations: plugin set: memory request for container console; memory limit for container console

kubectl get limitrange -n test

NAME              CREATED AT
mem-limit-range   2019-03-14T09:04:10Z

kubectl describe ns test

Name:         test
Labels:       <none>
Annotations:  <none>
Status:       Active

No resource quota.

Resource Limits
 Type       Resource  Min  Max  Default Request  Default Limit  Max Limit/Request Ratio
 ----       --------  ---  ---  ---------------  -------------  -----------------------
 Container  memory    -    -    256Mi            512Mi          -

Action to fix oom issue

After fixing limits setting and recreated pod, application become healthy.

kubectl delete limitrange mem-limit-range -n test
kubectl delete pod console-54dc5566b4-hx6wb
容器服务Serverless版ACK Serverless 快速入门:在线魔方应用部署和监控
通过本实验,您将了解到容器服务Serverless版ACK Serverless 的基本产品能力,即可以实现快速部署一个在线魔方应用,并借助阿里云容器服务成熟的产品生态,实现在线应用的企业级监控,提升应用稳定性。
课程大纲 开篇:如何学习并实践云原生技术 基础篇: 5 步上手 Kubernetes 进阶篇:生产环境下的 K8s 实践 相关的阿里云产品:容器服务&nbsp;ACK 容器服务&nbsp;Kubernetes&nbsp;版(简称&nbsp;ACK)提供高性能可伸缩的容器应用管理能力,支持企业级容器化应用的全生命周期管理。整合阿里云虚拟化、存储、网络和安全能力,打造云端最佳容器化应用运行环境。 了解产品详情:&nbsp;
前端开发 编解码 数据格式
浅谈响应式编程在企业级前端应用 UI 开发中的实践
浅谈响应式编程在企业级前端应用 UI 开发中的实践
20 0
浅谈响应式编程在企业级前端应用 UI 开发中的实践
Prometheus Kubernetes 监控
容器服务ACK(阿里云容器服务 Kubernetes 版)是阿里云提供的一种托管式Kubernetes服务,帮助用户轻松使用Kubernetes进行应用部署、管理和扩展。本汇总收集了容器服务ACK使用中的常见问题及答案,包括集群管理、应用部署、服务访问、网络配置、存储使用、安全保障等方面,旨在帮助用户快速解决使用过程中遇到的难题,提升容器管理和运维效率。
Kubernetes 监控 调度
Kubernetes Pod调度:从基础到高级实战技巧
Kubernetes Pod调度:从基础到高级实战技巧
190 0
Kubernetes 安全 Cloud Native
112 0
Kubernetes 监控 调度
存储 Kubernetes 调度
Kubernetes Pod生命周期
Kubernetes Pod生命周期
23 0
Kubernetes Pod生命周期
存储 Kubernetes 应用服务中间件
Kubernetes Pod
Kubernetes Pod
45 0
Kubernetes Pod
存储 Kubernetes 调度
K8s Pod亲和性、污点、容忍度、生命周期与健康探测详解(下)
Kubernetes 网络协议 Perl
k8s Failed to create pod sandbox: open /run/systemd/resolve/resolv.conf: no such file or directory
k8s Failed to create pod sandbox: open /run/systemd/resolve/resolv.conf: no such file or directory
24 0
Kubernetes Nacos 微服务
nacos常见问题之v2.2.3 k8s 微服务注册nacos强制删除 pod不消失如何解决
24 1
nacos常见问题之v2.2.3 k8s 微服务注册nacos强制删除 pod不消失如何解决

