epoll实现理解-EW帮帮网

根据前文高性能网络设计推演中，epoll作为一个“大杀器”为网络开发提供强大的支持。Linux系统上IO多路复用方案有select、poll、epoll。其中epoll的性能表现最优，且支持的并发量最大。本文大概介绍epoll的底层实现。

一、示例引入

了解epoll开发，最基本的暴露给应用层用户开发使用的有三个方法：

epoll_create：创建一个 epoll 对象
epoll_ctl：向 epoll 对象中添加要管理的连接
epoll_wait：等待其管理的连接上的 IO 事件

一个简单的使用流程样例如下：

int main(){
    listen(lfd, ...);

    cfd1 = accept(...);
    cfd2 = accept(...);
    efd = epoll_create(...);

    epoll_ctl(efd, EPOLL_CTL_ADD, cfd1, ...);
    epoll_ctl(efd, EPOLL_CTL_ADD, cfd2, ...);
    epoll_wait(efd, ...)
}

二、关键数据结构eventpoll

对于应用层epoll三个接口方法不再赘述，主要介绍一下eventpoll。其中eventpoll标识了一个epoll，对于一个进程可以管理多个连接fd，然后由epoll实现事件监听，并返回响应。都是围绕eventpoll所完成。以下为eventpoll源码：

/*
 * This structure is stored inside the "private_data" member of the file
 * structure and represents the main data structure for the eventpoll
 * interface.
 */
struct eventpoll {
    /*
     * This mutex is used to ensure that files are not removed
     * while epoll is using them. This is held during the event
     * collection loop, the file cleanup path, the epoll file exit
     * code and the ctl operations.
     */
    struct mutex mtx;

    /* Wait queue used by sys_epoll_wait() */
    wait_queue_head_t wq;

    /* Wait queue used by file->poll() */
    wait_queue_head_t poll_wait;

    /* List of ready file descriptors */
    struct list_head rdllist;

    /* Lock which protects rdllist and ovflist */
    rwlock_t lock;

    /* RB tree root used to store monitored fd structs */
    struct rb_root_cached rbr;

    /*
     * This is a single linked list that chains all the "struct epitem" that
     * happened while transferring ready events to userspace w/out
     * holding ->lock.
     */
    struct epitem *ovflist;

    /* wakeup_source used when ep_scan_ready_list is running */
    struct wakeup_source *ws;

    /* The user that created the eventpoll descriptor */
    struct user_struct *user;

    struct file *file;

    /* used to optimize loop detection check */
    int visited;
    struct list_head visited_list_link;

#ifdef CONFIG_NET_RX_BUSY_POLL
    /* used to track busy poll napi_id */
    unsigned int napi_id;
#endif
};

对于eventpoll需关注三个关键变量：

1、struct rb_root_cached rbr;

2、struct list_head rdllist;

3、wait_queue_head_t wq;

通过源码中注释可以了解其中作用：rbr为红黑树，rdllist为就绪队列，wq为等待队列。这是三种数据结构，对应到epoll中的功能而言，rbr为一颗红黑树，他维护了对epoll需要监听的事件集合。rdllist维护了就绪事件集合，当有事件就绪后就会挂到rdllist上通知应用层程序处理。wq为维护的实现epoll线程的阻塞状态。因为epoll_wait执行时，如果没有事件就绪，此时会让出cpu从而阻塞该线程，此时就会将该线程挂到wq上，后续就绪事件来临时就会重新运行该线程。

对于rdllist和wq都选用队列的数据结构比较好理解，因为此时的场景就是一个处理任务的场景，符合队列“先进先出”的特性。所以选用队列。

而对于红黑树的选择原因是综合了各方面的最优解。其中对于查询效率，在整个epoll的event中有很多查询场景，所以维护整体事件集合的数据结构需要较高的查询效率，但是对应查询效率更好的还有hashmap。这里就需要综合事件集合的修改以及内存开销这两个问题。对于红黑树不需要连续的内存空间进行存储，而hashmap需要连续的空间，并且对于应用层数量不定的事件内存的开辟大小就是一个问题。如果过小，后续添加就需要重新开辟连续空间并拷贝，如果过大就会导致内存空间的浪费。而且对于事件集合的修改而言红黑树也要更加方便，并且无hash冲突问题。综上红黑树是最优解。

三、具体流程

对于epoll的实现介绍，本文从应用层角度和内核态角度分别简单介绍，即“三+一“。应用层三个接口实现和内核层一个关键接口实现。（此处并非内核层只有一个接口，而是个人认为十分关键的一个实现，从而能够支持整体epoll的功能实现）

应用层角度：三个接口：epoll_create、epoll_ctl、epoll_wait

1、epoll_create：该接口关键功能为初始化eventpoll数据结构，具体为上文提到的rbr、rdllist等。然后返回一个epoll_fd。应用层后续可以通过该句柄实现对epoll的操作。

2、epoll_ctl：根据操作的不同（增加、删除、修改）对epitem节点进行操作。epitem节点标识了一个监听的句柄的某个事件。其中EPOLL_CTL_ADD为添加，最重要的是建立socket句柄与epoll的关联，即注册ep_poll_callback回调函数到sock->sk_wq。并将eptime挂到socket->wq->wait_queue_head上。后续当socket句柄可读或者可写时，会执行该回调函数以通知eventpoll，将该epitem挂到rdllist队列，如果此时epoll线程是阻塞的话也会将该线程从wq中唤醒。

3、epoll_wait：该操作会遍历rdllist就绪队列，如果就绪队列有就绪事件，则正常返回由应用层进行处理。否则会阻塞当前线程并将线程挂到eventpoll->wq上，然后让出CPU使用进入阻塞态。

4、当数据来临时，数据由网卡->网卡驱动->Linux内核协议栈(ip、tcp/udp等)调用sock_def_wakup唤醒该sock的wait_queue上的eptime，然后回调ep_poll_callback。由回调函数将epitme挂到eventpoll->rdllist，如果之前epoll_wait阻塞了当前线程，则一并唤醒当前线程。

5、ep_send_events实现从eventpoll->rdllist中取出就绪事件，并返回应用层event数组。后续由应用层处理就绪事件。

更多内容可参考：0voice · GitHub

epoll实现理解

一、示例引入

二、关键数据结构eventpoll

三、具体流程

网站公告

今日签到

热门文章

最新发布