Redis流量统计问题分析及修复

2017-02-13 6988

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

云数据库 MongoDB，独享型 2核8GB

Redis 开源版，标准版 2GB

云原生多模数据库 Lindorm，多引擎多规格 0-4节点

简介： 近日有用户反馈Redis的流量统计有问题, 文章对Redis的流量统计原理进行了分析并修复了原生Redis内核统计的一个问题。

背景

近日有用户反馈Redis的流量统计有问题，实际出口流量比客户端监察到的还大，通过监控我们可以看到后端采集的Redis出口流量为以下图表，其中单位为KByte每秒，所以我们可以看到内核统计的有10MB左右的流量。

我们后端天象系统会从协议栈层面统计每个Redis实例的流量情况，同一时刻图表如下，我们可以发现出口流量在2MB左右，和Redis统计的有一定偏差。

Redis 流量统计原理

后端监控采集的Redis出口流量为info命令返回的instantaneous_output_kbps值，该值的计算方式为

(float)getInstantaneousMetric(STATS_METRIC_NET_OUTPUT)/1024
        
          
        
        
        
          
          AI 代码解读

查看getInstantaneousMetric实现如下：

/* Return the mean of all the samples. */
long long getInstantaneousMetric(int metric) {
    int j;
    long long sum = 0;

    for (j = 0; j < STATS_METRIC_SAMPLES; j++)
        sum += server.inst_metric[metric].samples[j];
    return sum / STATS_METRIC_SAMPLES;
}
        
          
        
        
        
          
          AI 代码解读

我们可以看到出口流量是由server.inst_metric里面根据统计的类型得到的一个平均值，继续查看server.inst_metric的计算函数为trackInstantaneousMetric实现如下：

/* Add a sample to the operations per second array of samples. */
void trackInstantaneousMetric(int metric, long long current_reading) {
    long long t = mstime() - server.inst_metric[metric].last_sample_time;
    long long ops = current_reading -
                    server.inst_metric[metric].last_sample_count;
    long long ops_sec;

    ops_sec = t > 0 ? (ops*1000/t) : 0;

    server.inst_metric[metric].samples[server.inst_metric[metric].idx] =
        ops_sec;
    server.inst_metric[metric].idx++;
    server.inst_metric[metric].idx %= STATS_METRIC_SAMPLES;
    server.inst_metric[metric].last_sample_time = mstime();
    server.inst_metric[metric].last_sample_count = current_reading;
}
        
          
        
        
        
          
          AI 代码解读

trackInstantaneousMetric在serverCtron里面定时调用，代码如下：

run_with_period(100) {
        trackInstantaneousMetric(STATS_METRIC_COMMAND,server.stat_numcommands);
        trackInstantaneousMetric(STATS_METRIC_NET_INPUT,
                server.stat_net_input_bytes);
        trackInstantaneousMetric(STATS_METRIC_NET_OUTPUT,
                server.stat_net_output_bytes);
 }
        
          
        
        
        
          
          AI 代码解读

从以上函数我们可以看到流量的统计为定期对server.stat_net_output_bytes做统计计算得到的平均值，所以Redis出口流量计算的关键在于server.stat_net_output_bytes的计算，查看内核计算server.stat_net_output_bytes的代码如下：

/* Return true if the specified client has pending reply buffers to write to
 * the socket. */
int clientHasPendingReplies(client *c) {
    return c->bufpos || listLength(c->reply);
}

/* Write data in output buffers to client. Return C_OK if the client
 * is still valid after the call, C_ERR if it was freed. */
int writeToClient(int fd, client *c, int handler_installed) {
    ssize_t nwritten = 0, totwritten = 0;
    size_t objlen;
    size_t objmem;
    robj *o;

    while(clientHasPendingReplies(c)) {
        if (c->bufpos > 0) {
            nwritten = write(fd,c->buf+c->sentlen,c->bufpos-c->sentlen);
            if (nwritten <= 0) break;
            c->sentlen += nwritten;
            totwritten += nwritten;

            /* If the buffer was sent, set bufpos to zero to continue with
             * the remainder of the reply. */
            if ((int)c->sentlen == c->bufpos) {
                c->bufpos = 0;
                c->sentlen = 0;
            }
        } else {
            o = listNodeValue(listFirst(c->reply));
            objlen = sdslen(o->ptr);
            objmem = getStringObjectSdsUsedMemory(o);

            if (objlen == 0) {
                listDelNode(c->reply,listFirst(c->reply));
                c->reply_bytes -= objmem;
                continue;
            }

            nwritten = write(fd, ((char*)o->ptr)+c->sentlen,objlen-c->sentlen);
            if (nwritten <= 0) break;
            c->sentlen += nwritten;
            totwritten += nwritten;

            /* If we fully sent the object on head go to the next one */
            if (c->sentlen == objlen) {
                listDelNode(c->reply,listFirst(c->reply));
                c->sentlen = 0;
                c->reply_bytes -= objmem;
            }
        }
        /* */
        server.stat_net_output_bytes += totwritten;
        if (totwritten > NET_MAX_WRITES_PER_EVENT &&
            (server.maxmemory == 0 ||
             zmalloc_used_memory() < server.maxmemory)) break;
    }
    if (nwritten == -1) {
        if (errno == EAGAIN) {
            nwritten = 0;
        } else {
            serverLog(LL_VERBOSE,
                "Error writing to client: %s", strerror(errno));
            freeClient(c);
            return C_ERR;
        }
    }
    if (totwritten > 0) {
        /* */
        if (!(c->flags & CLIENT_MASTER)) c->lastinteraction = server.unixtime;
    }
    if (!clientHasPendingReplies(c)) {
        c->sentlen = 0;
        if (handler_installed) aeDeleteFileEvent(server.el,c->fd,AE_WRITABLE);
        if (c->flags & CLIENT_CLOSE_AFTER_REPLY) {
            freeClient(c);
            return C_ERR;
        }
    }
    return C_OK;
}
        
          
        
        
        
          
          AI 代码解读

仔细分析以上代码我们可以发现server.stat_net_output_bytes增加的totwritten的值会累加每次进入while循环的值，然后如果while循环多次执行的情况下每次都会累加一次totwritten这个值，而这个值没有复位，导致server.stat_net_output_bytes的值会重复计算之前的值，最终导致出口流量计算错误，我们可以将server.stat_net_output_bytes的计算移动到while循环外即可修复这个统计问题。根据以上分析修改内核重新查看监控图标如下，我们可以看到监控的数值和天象采集到的数值基本一致了。

总结

由于云数据库的资源限制并非采用的server.stat_net_output_bytes的值，所以资源限制方面并不会由于原生内核的流量计算错误受到影响，目前这个问题已经提交了一个pull request给antirez等待官方确定合并修复。阿里云Redis致力于提供最好的云数据库Redis服务，我们正在寻找有一样志向的同学加入我们，有兴趣的同学请猛击链接。

Redis流量统计问题分析及修复

背景

Redis 流量统计原理

总结

NoSQL数据库

热门文章

最新文章

相关产品

相关课程

相关电子书

相关实验场景

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

Redis流量统计问题分析及修复

背景

Redis 流量统计原理

总结

NoSQL数据库

热门文章

最新文章

相关产品

相关课程

相关电子书

相关实验场景