Plume: A Framework for High Performance Deep RL Network Controllers via Prioritized Trace Sampling

Deep Reinforcement Learning (DRL) has shown promise in various networking environments. However, these environments present several fundamental challenges for standard DRL techniques. They are difficult to explore and exhibit high levels of noise and uncertainty. Although these challenges complicate the training process, we find that in practice we can substantially mitigate their effects and even achieve state-of-the-art real-world performance by addressing a factor that has been previously overlooked: the skewed input trace distribution in DRL training datasets. We introduce a generalized framework, Plume, to automatically identify and balance the skew using a three-stage process. First, we identify the critical features that determine the behavior of the traces. Second, we classify the traces into clusters. Finally, we prioritize the salient clusters to improve the overall performance of the controller. Plume seamlessly works across DRL algorithms, without requiring any changes to the DRL workflow. We evaluated Plume on three networking environments, including Adaptive Bitrate Streaming, Congestion Control, and Load Balancing. Plume offers superior performance in both simulation and real-world settings, across different controllers and DRL algorithms. For example, our novel ABR controller, Gelato trained with Plume consistently outperforms prior state-of-the-art controllers on the live streaming platform Puffer for over a year. It is the first controller on the platform to deliver statistically significant improvements in both video quality and stalling, decreasing stalls by as much as 75%.

翻译：摘要：深度强化学习（DRL）在多种网络环境中展现出应用前景。然而，这些环境给标准DRL技术带来了若干根本性挑战：探索难度大，且存在高噪声和强不确定性。尽管这些挑战使训练过程复杂化，但我们发现，通过解决一个此前被忽视的因素——DRL训练数据集中输入轨迹分布的偏斜性——便能在实践中显著缓解其影响，甚至实现业界领先的实际性能。我们提出一个通用框架Plume，通过三阶段流程自动识别并平衡这种偏斜：首先，识别决定轨迹行为的关键特征；其次，将轨迹分类为不同集群；最后，优先处理关键集群以提升控制器整体性能。Plume能够无缝兼容各类DRL算法，无需对DRL工作流进行任何修改。我们在自适应比特率流媒体、拥塞控制和负载均衡三种网络环境中评估了Plume。无论是在仿真环境还是实际场景中，Plume在不同控制器和DRL算法下均展现出卓越性能。例如，使用Plume训练的新型ABR控制器Gelato在直播平台Puffer上持续一年超越此前最先进的控制器。该控制器是平台上首个在视频质量与卡顿两方面均实现统计显著改进的解决方案，将卡顿率降低高达75%。