Debug issue of OOM/pod restarting of in Kubernetes

2019-03-18 22770

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： # Debug issue of OOM/Pod restarting of in Kubernetes Recently, customers reported a issue that pod keep restarting after pod migrating from node to node, and java process inside pod exit abnormally

Debug issue of OOM/Pod restarting of in Kubernetes

Recently, customers reported a issue that pod keep restarting after pod migrating from node to node, and java process inside pod exit abnormally.
After a couple of troubleshooting, we could figure out root cause OOM caused by LimitRange of namespace, kernel killed newly created process once memory of JVM request exceeds the default limit. In this article, I will explain the troubleshooting method step by step, it should be common PD method for most application OOM/pod restarting related in Kubernetes.

Failed to start java/jetty application

kubectl logs console-54dc5566b4-nq6gs -n test

jetty process failed to start without any log

Starting Jetty: -Djava.io.tmpdir=/tmp/jetty
start-stop-daemon -S -p/var/run/jetty.pid -croot -d/var/lib/jetty -b -m -a /usr/bin/java -- -Djava.io.tmpdir=/tmp/jetty -Xms512m -Xmx1g -Djetty.logs=/usr/local/jetty/logs -Dspring.profiles.active=StandaloneLogin,Agility,NotAliCloud,NotPublicCloud -Djetty.home=/usr/local/jetty -Djetty.base=/var/lib/jetty -Djava.io.tmpdir=/tmp/jetty -jar /usr/local/jetty/start.jar jetty.state=/var/lib/jetty/jetty.state jetty-started.xml start-log-file=/usr/local/jetty/logs/start.log
jetty.state=/var/lib/jetty/jetty.state
FAILED Thu Mar 14 09:43:55 UTC 2019
tail: cannot open '/var/lib/jetty/logs/*.log' for reading: No such file or directory
tail: no files remaining

console pod keep re-creating on k8s cluster

kubectl get events --all-namespaces | grep console

test      1m        1m        1         console-54dc5566b4-sx2r6.158bc8b1f2a076ce   Pod       spec.containers{console}   Normal    Killing   kubelet, k8s003   Killing container with id docker://console:Container failed liveness probe.. Container will be killed and recreated.
test      1m        6m        2         console-54dc5566b4-hx6wb.158bc86c4379c4e7   Pod       spec.containers{console}   Normal    Started   kubelet, k8s001   Started container
test      1m        6m        2         console-54dc5566b4-hx6wb.158bc86c355ab395   Pod       spec.containers{console}   Normal    Created   kubelet, k8s001   Created container
test      1m        6m        2         console-54dc5566b4-hx6wb.158bc86c2fe32c76   Pod       spec.containers{console}   Normal    Pulled    kubelet, k8s001   Container image "registry.cn-hangzhou.aliyuncs.com/eric-dev/console:0.9-62f837e" already present on machine
test      1m        1m        1         console-54dc5566b4-hx6wb.158bc8b87083e752   Pod       spec.containers{console}   Normal    Killing   kubelet, k8s001   Killing container with id docker://console:Container failed liveness probe.. Container will be killed and recreated.

Determine an OOM from pod state

kubectl get pod console-54dc5566b4-hx6wb -n test -o yaml | grep reason -C5

lastState:
      terminated:
        containerID: docker://90e5c9e618f3e745ebf510b8f215da3a165e3d03be58e0369e27c1773e75ef70
        exitCode: 137
        finishedAt: 2019-03-14T09:29:51Z
        reason: OOMKilled
        startedAt: 2019-03-14T09:24:51Z
    name: console
    ready: true
    restartCount: 3
    state:

kubectl get pod console-54dc5566b4-hx6wb -n test -o jsonpath='{.status.containerStatuses[].lastState}'

map[terminated:map[exitCode:137 reason:OOMKilled startedAt:2019-03-14T09:24:51Z finishedAt:2019-03-14T09:29:51Z containerID:docker://90e5c9e618f3e745ebf510b8f215da3a165e3d03be58e0369e27c1773e75ef70]]

Detect oom thru system log to validate assumption

Following error indicate an java oom caused by cgroup setting

# grep oom /var/log/messages

/var/log/messages:2019-03-14T09:15:17.541049+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949064] java invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=0, order=0, oom_score_adj=968
/var/log/messages:2019-03-14T09:15:17.541117+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949153]  [<ffffffff81191de4>] oom_kill_process+0x214/0x3f0
/var/log/messages:2019-03-14T09:15:17.541119+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949171]  [<ffffffff811f9481>] mem_cgroup_oom_synchronize+0x2f1/0x310
/var/log/messages:2019-03-14T09:15:17.541147+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.950571] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name

#grep oom /var/log/warn

2019-03-14T09:15:17.541049+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949064] java invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=0, order=0, oom_score_adj=968
2019-03-14T09:15:17.541117+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949153]  [<ffffffff81191de4>] oom_kill_process+0x214/0x3f0
2019-03-14T09:15:17.541119+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949171]  [<ffffffff811f9481>] mem_cgroup_oom_synchronize+0x2f1/0x310

Root cause:

kubectl get pod console-54dc5566b4-hx6wb -n test -o yaml | grep limits -A4

limits:
  memory: 512Mi
requests:
  memory: 256Mi

In this case, application console pod extends the limits setting from default limits of namespace test

kubectl describe pod console-54dc5566b4-hx6wb -n test | grep limit

Annotations:        kubernetes.io/limit-ranger=LimitRanger plugin set: memory request for container console; memory limit for container console

kubectl get limitrange -n test

NAME              CREATED AT
mem-limit-range   2019-03-14T09:04:10Z

kubectl describe ns test

Name:         test
Labels:       <none>
Annotations:  <none>
Status:       Active

No resource quota.

Resource Limits
 Type       Resource  Min  Max  Default Request  Default Limit  Max Limit/Request Ratio
 ----       --------  ---  ---  ---------------  -------------  -----------------------
 Container  memory    -    -    256Mi            512Mi          -

Action to fix oom issue

After fixing limits setting and recreated pod, application become healthy.

kubectl delete limitrange mem-limit-range -n test
kubectl delete pod console-54dc5566b4-hx6wb

Debug issue of OOM/pod restarting of in Kubernetes

Debug issue of OOM/Pod restarting of in Kubernetes

Failed to start java/jetty application

Determine an OOM from pod state

Detect oom thru system log to validate assumption

Root cause:

Action to fix oom issue

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像