Existing network stacks tackle performance and scalability aspects by relying on multiple receive queues. However, at software level, each queue is processed by a single thread, which prevents simultaneous work on the same queue and limits performance in terms of tail latency. To overcome this limitation, we introduce COREC, the first software implementation of a concurrent non-blocking single-queue receive driver. By sharing a single queue among multiple threads, workload distribution is improved, leading to a work-conserving policy for network stacks. On the technical side, instead of relying on traditional critical sections - which would sequentialize the operations by threads - COREC coordinates the threads that concurrently access the same receive queue in non-blocking manner via atomic machine instructions from the Read-Modify-Write (RMW) class. These instructions allow threads to access and update memory locations atomically, based on specific conditions, such as the matching of a target value selected by the thread. Also, they enable making any update globally visible in the memory hierarchy, bypassing interference on memory consistency caused by the CPU store buffers. Extensive evaluation results demonstrate that the possible additional reordering, which our approach may occasionally cause, is non-critical and has minimal impact on performance, even in the worst-case scenario of a single large TCP flow, with performance impairments accounting to at most 2-3 percent. Conversely, substantial latency gains are achieved when handling UDP traffic, real-world traffic mix, and multiple shorter TCP flows.
翻译:现有网络协议栈依赖多接收队列以实现高性能与可扩展性。然而在软件层面,每个队列仅由单个线程处理,导致同一队列无法并行工作,并在尾延迟方面制约了性能表现。为突破这一限制,我们提出COREC——首个实现并发无阻塞单队列接收驱动的软件方案。通过允许多线程共享单个队列,工作负载分配得到优化,从而为网络协议栈实现了工作守恒策略。技术层面,COREC摒弃传统的临界区机制(该机制将使线程操作顺序化),转而采用读-修改-写(RMW)类原子机器指令,以无阻塞方式协调同时访问同一接收队列的多个线程。这些指令允许线程基于特定条件(如线程选定目标值的匹配)原子性地访问并更新内存位置,同时使所有更新全局可见于内存层次体系,从而规避CPU存储缓冲区对内存一致性的干扰。大量评估结果表明,本方法偶发的额外重排序问题非关键性且对性能影响极小——即使在最恶劣的单大TCP流场景下,性能损耗最高仅达2-3%。相反,在处理UDP流量、真实网络混合流量及多个较短的TCP流时,可实现显著的延迟优化。