Providing a high Quality of Experience (QoE) for video streaming in 5G and beyond 5G (B5G) networks is challenging due to the dynamic nature of the underlying network conditions. Several Adaptive Bit Rate (ABR) algorithms have been developed to improve QoE, but most of them are designed based on fixed rules and unsuitable for a wide range of network conditions. Recently, Deep Reinforcement Learning (DRL) based Asynchronous Advantage Actor-Critic (A3C) methods have recently demonstrated promise in their ability to generalise to diverse network conditions, but they still have limitations. One specific issue with A3C methods is the lag between each actor's behavior policy and central learner's target policy. Consequently, suboptimal updates emerge when the behavior and target policies become out of synchronization. In this paper, we address the problems faced by vanilla-A3C by integrating the on-policy-based multi-agent DRL method into the existing video streaming framework. Specifically, we propose a novel system for ABR generation - Proximal Policy Optimization-based DRL for Adaptive Bit Rate streaming (PPO-ABR). Our proposed method improves the overall video QoE by maximizing sample efficiency using a clipped probability ratio between the new and the old policies on multiple epochs of minibatch updates. The experiments on real network traces demonstrate that PPO-ABR outperforms state-of-the-art methods for different QoE variants.
翻译:在5G及超5G(B5G)网络中,由于底层网络条件的动态特性,为视频流提供高质量体验(QoE)极具挑战性。目前已开发多种自适应比特率(ABR)算法以提升QoE,但大多数基于固定规则设计,难以适应广泛的网络条件。近期,基于深度强化学习(DRL)的异步优势演员-评论家(A3C)方法在泛化至多样化网络条件方面展现出潜力,但仍存在局限性。A3C方法的一个具体问题在于每个智能体的行为策略与中央学习器的目标策略之间存在滞后,导致行为策略与目标策略失同步时产生次优更新。本文通过将基于在线策略的多智能体DRL方法集成至现有视频流框架,以解决原始A3C面临的问题。具体而言,我们提出了一种新颖的ABR生成系统——面向自适应比特率流媒体的近端策略优化DRL方法(PPO-ABR)。该方法通过在小批量更新的多个周期上使用新旧策略之间的裁剪概率比,最大化样本效率,从而提升整体视频QoE。在真实网络轨迹上的实验表明,PPO-ABR在不同QoE变体上均优于现有最先进方法。