2BRobust -- Overcoming TCP BBR Performance Degradation in Virtual Machines under CPU Contention

Motivated by the recent introduction and large-scale deployment of BBR congestion control algorithms, multiple studies have investigated the performance and fairness implications of this shift from loss-based to delay-based congestion control. Given the potential Internet-wide adoption of BBR, we must also consider its robustness in network and system scenarios. One such scenario is Cloud-based Virtual Machine (VM) networking - highly relevant in today's CDN-centric Internet. Interestingly, previous work has shown significant performance problems of BBRv1-2 running in Xen VMs, with BBR performance dropping to almost zero when CPU credit is low. In this paper, we develop a framework for measuring TCP throughput under fully controlled CPU contention, which uses Linux deadline scheduling to emulate generalized CPU contention conditions. Our measurements reveal that - in stark contrast to Cubic! - BBR throughput can break down during CPU contention under any hypervisor and all tested BDP conditions. Characterizing this performance degradation on a fine-granular level, we show that CPU limited BBR senders are capped at very low throughput levels below 10-20 Mbps. This finding implies that an Internet-wide shift from Cubic to BBR could harm the Internet's overall robustness, if not deployed with caution. To detect and overcome CPU-limited throughput, we propose a minimal BBR patch which detects the problematic situation by monitoring inflight bytes and reacts by increasing the pacing rate to make better use of the available CPU time. We show that our BBR patch overcomes the throughput problem for the most critical cases.

翻译：受近期BBR拥塞控制算法的引入和大规模部署所驱动，多项研究探讨了从基于丢包的拥塞控制转向基于延迟的拥塞控制对性能和公平性的影响。考虑到BBR可能在互联网范围内被广泛采用，我们还必须评估其在网络和系统场景中的鲁棒性。其中一个重要场景是云虚拟机网络环境——这在当今以内容分发网络为中心的互联网中尤为关键。值得注意的是，先前研究表明，在Xen虚拟机中运行的BBRv1-2存在显著的性能问题：当CPU信用值较低时，BBR性能会降至近乎零。本文开发了一个在完全受控CPU竞争条件下测量TCP吞吐量的框架，该框架利用Linux截止时间调度来模拟广义的CPU竞争条件。我们的测量结果表明——与Cubic形成鲜明对比的是——在任何虚拟机监控程序及所有测试的带宽延迟积条件下，BBR吞吐量都可能在CPU竞争期间急剧下降。通过对这种性能退化进行细粒度表征，我们发现受CPU限制的BBR发送端会被限制在极低的吞吐量水平（低于10-20 Mbps）。这一发现意味着，若未谨慎部署，从Cubic到BBR的互联网级迁移可能损害互联网的整体鲁棒性。为检测并克服CPU限制的吞吐量问题，我们提出一个最小化的BBR补丁：该补丁通过监控传输中的字节数来检测问题状态，并通过提高数据包发送速率来更有效地利用可用CPU时间。实验证明，我们的BBR补丁能在最关键的场景中有效解决吞吐量问题。