Offline Reinforcement Learning (RL) is a promising approach for next-generation wireless networks, where online exploration is unsafe and large amounts of operational data can be reused across the model lifecycle. However, the behavior of offline RL algorithms under genuinely stochastic dynamics -- inherent to wireless systems due to fading, noise, and traffic mobility -- remains insufficiently understood. We address this gap by evaluating Bellman-based (Conservative Q-Learning), sequence-based (Decision Transformers), and hybrid (Critic-Guided Decision Transformers) offline RL methods in an open-access stochastic telecom environment (mobile-env). Our results show that Conservative Q-Learning consistently produces more robust policies across different sources of stochasticity, making it a reliable default choice in lifecycle-driven AI management frameworks. Sequence-based methods remain competitive and can outperform Bellman-based approaches when sufficient high-return trajectories are available. These findings provide practical guidance for offline RL algorithm selection in AI-driven network control pipelines, such as O-RAN and future 6G functions, where robustness and data availability are key operational constraints.
翻译:离线强化学习(Reinforcement Learning, RL)是下一代无线网络的一种前景广阔的方法,因为在网络环境中进行在线探索存在风险,且大量运营数据可在模型生命周期内重复利用。然而,对于无线系统固有的衰落、噪声和业务移动性所导致的真实随机动态,离线强化学习算法的行为仍缺乏充分理解。我们通过在开放存取的随机电信环境(mobile-env)中评估基于贝尔曼的(Conservative Q-Learning)、基于序列的(Decision Transformers)以及混合型(Critic-Guided Decision Transformers)离线强化学习方法,来填补这一空白。结果表明,Conservative Q-Learning 在不同随机性来源下均能持续产生更鲁棒的策略,使其成为生命周期驱动的AI管理框架中可靠的默认选择。当可获得足够多的高回报轨迹时,基于序列的方法仍具有竞争力,并且可以超越基于贝尔曼的方法。这些发现为AI驱动的网络控制流程(如O-RAN和未来6G功能)中的离线强化学习算法选择提供了实用指导,其中鲁棒性和数据可用性是关键的运营约束。