Mobile devices such as smartphones, laptops, and tablets can often connect to multiple access networks (e.g., Wi-Fi, LTE, and 5G) simultaneously. Recent advancements facilitate seamless integration of these connections below the transport layer, enhancing the experience for apps that lack inherent multi-path support. This optimization hinges on dynamically determining the traffic distribution across networks for each device, a process referred to as \textit{multi-access traffic splitting}. This paper introduces \textit{NetworkGym}, a high-fidelity network environment simulator that facilitates generating multiple network traffic flows and multi-access traffic splitting. This simulator facilitates training and evaluating different RL-based solutions for the multi-access traffic splitting problem. Our initial explorations demonstrate that the majority of existing state-of-the-art offline RL algorithms (e.g. CQL) fail to outperform certain hand-crafted heuristic policies on average. This illustrates the urgent need to evaluate offline RL algorithms against a broader range of benchmarks, rather than relying solely on popular ones such as D4RL. We also propose an extension to the TD3+BC algorithm, named Pessimistic TD3 (PTD3), and demonstrate that it outperforms many state-of-the-art offline RL algorithms. PTD3's behavioral constraint mechanism, which relies on value-function pessimism, is theoretically motivated and relatively simple to implement.
翻译:智能手机、笔记本电脑和平板电脑等移动设备通常可同时连接多个接入网络(如Wi-Fi、LTE和5G)。最新技术进展促进了这些连接在传输层以下的无缝集成,从而提升了原本缺乏多路径支持的应用体验。此类优化的关键在于动态确定每个设备在不同网络间的流量分配,该过程被称为\textit{多接入流量分割}。本文提出\textit{NetworkGym}——一个支持生成多网络流量流及多接入流量分割的高保真网络环境模拟器。该模拟器为多接入流量分割问题中不同基于强化学习的解决方案提供了训练与评估平台。初步研究表明,现有绝大多数离线强化学习算法(如CQL)平均表现未能超越某些手工设计的启发式策略。这凸显了除D4RL等流行基准外,迫切需要将离线强化学习算法置于更广泛基准中进行评估。我们同时提出TD3+BC算法的扩展版本——悲观TD3算法(PTD3),并证明其性能优于多种前沿离线强化学习算法。PTD3基于价值函数悲观主义的行为约束机制具有理论依据,且实现相对简洁。