sudo bug导致的zabbix断图问题-阿里云开发者社区

线上使用zabbix的host update来监测监控值是否完整（关于host update的实现请参考：

http://caiguangguang.blog.51cto.com/1652935/1345789）

一直发现有机器过一段时间update值就会莫名其妙变低，之前一直没有找到rc，只是简单通过重启agent来进行修复，最近同事细心地发现可能是和sudo的bug有关系。

回过头再来验证下整个的排查过程。

1.通过zabbix 数据库获取丢失数据的item，拿出缺失的(20分钟没有更新的)值的item列表

 
        select 
        b.key_,b.lastvalue,from_unixtime(b.lastclock) 
        from 
        hosts a, 
       
        items b 
        where 
        a.hostid=b.hostid 
        and 
        a.host=
        'xxxxxx' 
        and 
       
        b.lastclock < (unix_timestamp() - 1200) limit 10;

比如这里看agent.ping:

观察监控图，发现在18点20分之后数据丢失：

2.分析zabbix agent端的日志

发现在18点24粉左右出现下面的日志，没有看到正常的获取值和发送值的情况，只有大量的update_cpustats状态，同时发现有一行kill command 失败的日志:

 
        27589:20141021:182442.143 In zbx_popen() 
        command
        :
        'sudo hadoop_stats.sh nodemanager StopContainerAvgTime' 
       
        27589:20141021:182442.143 End of zbx_popen():5
       
        48430:20141021:182442.143 zbx_popen(): executing script
       
        27585:20141021:182442.284 In update_cpustats()
       
        27585:20141021:182442.285 End of update_cpustats()
       
        27585:20141021:182443.285 In update_cpustats()
       
        27585:20141021:182443.286 End of update_cpustats()
       
        27585:20141021:182444.286 In update_cpustats()
       
        27585:20141021:182444.287 End of update_cpustats()
       
        27585:20141021:182445.287 In update_cpustats()
       
        27585:20141021:182445.287 End of update_cpustats()
       
        27585:20141021:182446.288 In update_cpustats()
       
        27585:20141021:182446.288 End of update_cpustats()
       
        ..........
       
        27585:20141021:182508.305 In update_cpustats()
       
        27585:20141021:182508.305 End of update_cpustats()
       
        27585:20141021:182509.306 In update_cpustats()
       
        27585:20141021:182509.306 End of update_cpustats()
       
        27585:20141021:182510.306 In update_cpustats()
       
        27585:20141021:182510.307 End of update_cpustats()
       
        27585:20141021:182511.307 In update_cpustats()
       
        27585:20141021:182511.308 End of update_cpustats()
       
        27589:20141021:182512.154 failed to 
        kill 
        [
        sudo 
        hadoop_stats.sh nodemanager StopContainerAvgTime]: [1] Operation not permitted 
       
        27589:20141021:182512.155 In zbx_waitpid()
       
        27585:20141021:182512.308 In update_cpustats()
       
        27585:20141021:182512.309 End of update_cpustats()
       
        27585:20141021:182513.309 In update_cpustats()
       
        27585:20141021:182513.309 End of update_cpustats()

对比正常的日志：

 
        27589:20141021:180054.376 In zbx_popen() 
        command
        :
        'sudo hadoop_stats.sh nodemanager StopContainerAvgTime' 
       
        27589:20141021:180054.377 End of zbx_popen():5
       
        18798:20141021:180054.377 zbx_popen(): executing script
       
        27589:20141021:180054.384 In zbx_waitpid()
       
        27589:20141021:180054.384 zbx_waitpid() exited, status:1
       
        27589:20141021:180054.384 End of zbx_waitpid():18798
       
        27589:20141021:180054.384 Run remote 
        command 
        [
        sudo  
        hadoop_stats.sh nodemanager StopContainerAvgTime] Result [2] [-1]... 
       
        27589:20141021:180054.384 For key [hadoop_stats[nodemanager,StopContainerAvgTime]] received value [-1]
       
        27589:20141021:180054.384 In process_value() key:
        'gd6g203s80-hadoop-datanode.idc.vipshop.com:hadoop_stats[nodemanager,StopContainerAvgTime]' 
        value:
        '-1' 
       
        27589:20141021:180054.384 In send_buffer() host:
        '10.200.100.28' 
        port:10051 values:37
        /50 
       
        27589:20141021:180054.384 Will not send now. Now 1413885654 lastsent 1413885654 < 1
       
        27589:20141021:180054.385 End of send_buffer():SUCCEED
       
        27589:20141021:180054.385 buffer: new element 37
       
        27589:20141021:180054.385 End of process_value():SUCCEED

可以看到正常情况下脚本会有返回值，而出问题的时候，脚本是没有返回值的，并且由于是使用sudo 运行脚本，导致以普通用户启动的zabbix在超时时没有办法杀掉这个command(Operation not permitted 错误)

3.假设这里启动zabbix agent的普通用户为apps用户，我们看下这个脚本目前的状态

 
        ps 
        -ef|
        grep 
        hadoop_stats.sh 
       
        root     34494 31429  0 12:54 pts
        /0    
        00:00:00 
        grep 
        48430 
       
        root     48430 27589  0 Oct21 ?        00:00:00 
        sudo 
        hadoop_stats.sh nodemanager StopContainerAvgTime 
       
        root     48431 48430  0 Oct21 ?        00:00:00 [hadoop_stats.sh] <defunct>

可以看到，这里产生了一个僵尸进程（关于僵尸进程可以参考：http://en.wikipedia.org/wiki/Zombie_process）

僵尸进程是由于子进程运行完毕之后，发送SIGCHLD到父进程，而父进程没有正常处理这个信号导致。

 
        You have killed the process, but a dead process doesn't disappear from the process table 
       
        until 
        its parent process performs a task called 
        "reaping" 
        (essentially calling wait(3) 
       
        for 
        that process to 
        read 
        its 
        exit 
        status). Dead processes that haven't been reaped are 
       
        called 
        "zombie processes." 
       
        The parent process 
        id 
        you see 
        for 
        31756 is process 
        id 
        1, 
        which 
        always belongs to init.  
       
        That process should reap its zombie processes periodically, but 
        if 
        it can't, they will 
       
        remain zombies 
        in 
        the process table 
        until 
        you reboot.

正常的进程情况下，我们使用strace attach到父进程，然后杀掉子进程后可以看到如下信息：

 
        Process 3036 attached - interrupt to quit
       
        select
        (6, [5], [], NULL, NULL 
       
        )          = ? ERESTARTNOHAND (To be restarted)
       
        --- SIGCHLD (Child exited) @ 0 (0) ---
       
        rt_sigreturn(0x11)                      = -1 EINTR (Interrupted system call)
       
        wait4(3037, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], WNOHANG|WSTOPPED, NULL) = 3037
       
        exit_group(143)                         = ?
       
        Process 3036 detached

产生僵尸进程之后，可以通过杀掉父进程把僵尸进程变成孤儿进程（父进程为init进程）

但是这里因为是用sudo启动的脚本，导致启动用户都是root，apps用户就没有权限杀掉启动的命令，进而导致子进程一直是僵尸进程的状态存在

4.来看一下zabbix agent端启动的相关进程情况

 
  
    
      
     
        ps -ef|grep zabbix
       
 
        apps     
        27583     
        1  
        0 
        Sep09 ?        
        00
        :
        00
        :
        00 
        /apps/svr/zabbix/sbin/zabbix_agentd -c /apps/conf/zabbix_agentd.conf 
       
 
        apps     
        27585 
        27583  
        0 
        Sep09 ?        
        00
        :
        33
        :
        25 
        /apps/svr/zabbix/sbin/zabbix_agentd -c /apps/conf/zabbix_agentd.conf 
       
 
        apps     
        27586 
        27583  
        0 
        Sep09 ?        
        00
        :
        00
        :
        14 
        /apps/svr/zabbix/sbin/zabbix_agentd -c /apps/conf/zabbix_agentd.conf 
       
 
        apps     
        27587 
        27583  
        0 
        Sep09 ?        
        00
        :
        00
        :
        14 
        /apps/svr/zabbix/sbin/zabbix_agentd -c /apps/conf/zabbix_agentd.conf 
       
 
        apps     
        27588 
        27583  
        0 
        Sep09 ?        
        00
        :
        00
        :
        14 
        /apps/svr/zabbix/sbin/zabbix_agentd -c /apps/conf/zabbix_agentd.conf 
       
 
        apps     
        27589 
        27583  
        0 
        Sep09 ?        
        02
        :
        28
        :
        12 
        /apps/svr/zabbix/sbin/zabbix_agentd -c /apps/conf/zabbix_agentd.conf 
       
 
        root     
        34207 
        31429  
        0 
        12
        :
        54 
        pts/
        0    
        00
        :
        00
        :
        00 
        grep zabbix 
       
 
        root     
        48430 
        27589  
        0 
        Oct21 ?        
        00
        :
        00
        :
        00 
        sudo /apps/sh/zabbix_scripts/hadoop/hadoop_stats.sh nodemanager StopContainerAvgTime 
       
 
    

   
 

通过strace我们发现27589的进程一直在等待48430的进程

 
        strace  -p 
        27589 
       
        Process 
        27589 
        attached - interrupt to quit 
       
        wait4(
        48430
        , ^C <unfinished ...> 
       
        Process 
        27589 
        detached

而48430的进程即为僵尸进程的父进程，通过strace attach上去，可以看到在等待#5的fd

 
        strace  -p 
        48430 
       
        Process 
        48430 
        attached - interrupt to quit 
       
        select(
        6
        , [
        5
        ], [], NULL, NULL^C <unfinished ...> 
       
        Process 
        48430 
        detached

通过lsof可以看到#5的fd其实是一个socket

 
  
    
      
      
        lsof  -p 
        48430 
       

        COMMAND   PID USER   FD   TYPE             DEVICE  SIZE/OFF       NODE NAME
       
 
        sudo    
        48430 
        root  cwd    DIR                
        8
        ,
        2      
        4096          
        2 
        / 
       
 
        sudo    
        48430 
        root  rtd    DIR                
        8
        ,
        2      
        4096          
        2 
        / 
       
 
        sudo    
        48430 
        root  txt    REG                
        8
        ,
        2    
        212904    
        1578739 
        /usr/bin/sudo 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2     
        65928    
        1441822 
        /lib64/libnss_files-
        2.12
        .so 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2  
        99158704    
        1573509 
        /usr/lib/locale/locale-archive 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2     
        91096    
        1441832 
        /lib64/libz.so.
        1.2
        .
        3 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2    
        141576    
        1442145 
        /lib64/libpthread-
        2.12
        .so 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2    
        386040    
        1442172 
        /lib64/libfreebl3.so 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2    
        108728    
        1575924 
        /usr/lib64/libsasl2.so.
        2.0
        .
        23 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2    
        243064    
        1441896 
        /lib64/libnspr4.so 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2     
        21256    
        1442186 
        /lib64/libplc4.so 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2     
        17096    
        1442187 
        /lib64/libplds4.so 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2    
        128368    
        1577789 
        /usr/lib64/libnssutil3.so 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2   
        1290648    
        1582418 
        /usr/lib64/libnss3.so 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2    
        188072    
        1575925 
        /usr/lib64/libsmime3.so 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2    
        220200    
        1587191 
        /usr/lib64/libssl3.so 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2    
        113952    
        1442182 
        /lib64/libresolv-
        2.12
        .so 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2     
        43392    
        1442173 
        /lib64/libcrypt-
        2.12
        .so 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2     
        63304    
        1442180 
        /lib64/liblber-
        2.4
        .so.
        2.5
        .
        6 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2   
        1979000    
        1442169 
        /lib64/libc-
        2.12
        .so 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2    
        308912    
        1442181 
        /lib64/libldap-
        2.4
        .so.
        2.5
        .
        6 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2     
        22536    
        1442171 
        /lib64/libdl-
        2.12
        .so 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2     
        58480    
        1442174 
        /lib64/libpam.so.
        0.82
        .
        2 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2     
        17520    
        1441884 
        /lib64/libutil-
        2.12
        .so 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2    
        124624    
        1441798 
        /lib64/libselinux.so.
        1 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2     
        99112    
        1442170 
        /lib64/libaudit.so.
        1.0
        .
        0 
       
 
        sudo    
        48430 
        root  mem    REG                
        8
        ,
        2    
        156872    
        1442168 
        /lib64/ld-
        2.12
        .so 
       
 
        sudo    
        48430 
        root    0r   CHR                
        1
        ,
        3       
        0t0       
        3916 
        /dev/
        null 
       
 
        sudo    
        48430 
        root    1w  FIFO                
        0
        ,
        8       
        0t0 
        1429910151 
        pipe 
       
 
        sudo    
        48430 
        root    2w   REG                
        8
        ,
        3 
        376639626     
        524292 
        /apps/logs/zabbix/zabbix_agentd.log 
       
 
        sudo    
        48430 
        root    3u  sock                
        0
        ,
        6       
        0t0 
        1429910161 
        can't identify protocol 
       
 
        sudo    
        48430 
        root    4r   REG                
        8
        ,
        2       
        764    
        2240617 
        /etc/group 
       
 
        sudo    
        48430 
        root    5u  unix 
        0xffff880179ee4680       
        0t0 
        1429910162 
        socket 
       
 
    

   
 

这里通过查看/proc/pid/fd下的文件描述符的状态，发现这个fd其实是已经关闭的。

这里就有可能是子进程已经运行完成，而父进程没有正确处理子进程的返回信息导致父进程一直认为子进程还在运行，最终产生了僵尸进程。

这其实是sudo的一个bug，相关的bug id :

http://www.gratisoft.us/bugzilla/show_bug.cgi?id=447

关于bug的描述：

 
        If the parent process gets re-scheduled after the “
        if
        ” was executed, and at 
        this 
        very  
       
        time the child process finishes and SIGCHLD is sent to the parent process, sudo gets 
       
        in trouble. The SIGCHLD handler accounts in the variable “recvsig[]” that the signal
       
        was received, and then the parent process calls select(). This select will never be  
       
        interrupted, as the author had it in mind. In 
        99
        % of the cases, the parent process  
       
        will enter in the select() blocking state before the child process ended.  
       
        The child would then send SIGCHLD, which will be accounted in the handler procedure,  
       
        and will also interrupt select() which will 
        return 
        -
        1 
        in “nready”, and “errno” 
       
        will be set to EINTR.

问题出在sudo的代码sudo/file/tip/src/exec.c，小于 1.7.5或1.8.0 之前的版本都有问题，当子进程恰好在select()这个系统调用前退出的时候，句柄已经被退出，所以sudo会卡在select这里

patch:

http://www.sudo.ws/repos/sudo/rev/99adc5ea7f0a

 
        Avoid a potential race condition 
        if 
        SIGCHLD is received immediately before we call select().

网上有人遇到了同样地问题：

http://blog.famzah.net/2010/11/01/sudo-hangs-and-leaves-the-executed-program-as-zombie/

回过头总结，这个问题其实是多个潜在问题同时造成：

1.zabbix agent的自定义监控配置中使用了sudo，导致僵尸进程的父进程不能正常关闭(如果需要sudo，写在脚本里面即可)

2.sudo的bug导致产生了僵死进程（升级sudo即可）

3.zabbix agent端的实现也有问题，某一个进程成为僵死进程后会影响其他的监控项获取（zabbix agent的进程被阻塞导致）

本文转自菜菜光 51CTO博客，原文链接：http://blog.51cto.com/caiguangguang/1566952，如需转载请自行联系原作者

sudo bug导致的zabbix断图问题

热门文章

最新文章

相关电子书

相关实验场景