Because of reinforcement learning's (RL) ability to automatically create more adaptive controlling logics beyond the hand-crafted heuristics, numerous effort has been made to apply RL to congestion control (CC) design for real time video communication (RTC) applications and has successfully shown promising benefits over the rule-based RTC CCs. Online reinforcement learning is often adopted to train the RL models so the models can directly adapt to real network environments. However, its trail-and-error manner can also cause catastrophic degradation of the quality of experience (QoE) of RTC application at run time. Thus, safeguard strategies such as falling back to hand-crafted heuristics can be used to run along with RL models to guarantee the actions explored in the training sensible, despite that these safeguard strategies interrupt the learning process and make it more challenging to discover optimal RL policies. The recent emergence of loss-tolerant neural video codecs (NVC) naturally provides a layer of protection for the online learning of RL-based congestion control because of its resilience to packet losses, but such packet loss resilience have not been fully exploited in prior works yet. In this paper, we present a reinforcement learning (RL) based congestion control which can be aware of and takes advantage of packet loss tolerance characteristic of NVCs via reward in online RL learning. Through extensive evaluation on various videos and network traces in a simulated environment, we demonstrate that our NVC-aware CC running with the loss-tolerant NVC reduces the training time by 41\% compared to other prior RL-based CCs. It also boosts the mean video quality by 0.3 to 1.6dB, lower the tail frame delay by 3 to 200ms, and reduces the video stalls by 20\% to 77\% in comparison with other baseline RTC CCs.
翻译:由于强化学习(RL)能够超越手工设计的启发式规则,自动创建更具适应性的控制逻辑,大量研究致力于将RL应用于实时视频通信(RTC)应用的拥塞控制(CC)设计,并已成功展现出优于基于规则的RTC CC的潜力。通常采用在线强化学习来训练RL模型,以便模型能够直接适应真实的网络环境。然而,其试错性质也可能在运行时导致RTC应用体验质量(QoE)的灾难性下降。因此,可以采用诸如回退到手工启发式规则等安全策略与RL模型一同运行,以确保训练中探索的动作是合理的,尽管这些安全策略会中断学习过程,并使发现最优RL策略更具挑战性。近期出现的具有丢包容性的神经视频编解码器(NVC)因其对丢包的鲁棒性,自然为基于RL的拥塞控制的在线学习提供了一层保护,但这种丢包鲁棒性在先前工作中尚未得到充分利用。本文提出了一种基于强化学习(RL)的拥塞控制方法,该方法能够通过在线RL学习中的奖励机制,感知并利用NVC的丢包容忍特性。通过在模拟环境中对各种视频和网络轨迹进行广泛评估,我们证明,与丢包容性NVC协同运行的、我们所提出的NVC感知CC,相比其他先前的基于RL的CC,训练时间减少了41%。与其他基线RTC CC相比,它还将平均视频质量提升了0.3至1.6dB,将尾部帧延迟降低了3至200毫秒,并将视频卡顿减少了20%至77%。