Efficient mobility management and load balancing are critical to sustaining Quality of Service (QoS) in dense, highly dynamic 5G radio access networks. We present a deep reinforcement learning framework based on Proximal Policy Optimization (PPO) for autonomous, QoS-aware load balancing implemented end-to-end in a lightweight, pure-Python simulation environment. The control problem is formulated as a Markov Decision Process in which the agent periodically adjusts Cell Individual Offset (CIO) values to steer user-cell associations. A multi-objective reward captures key performance indicators (aggregate throughput, latency, jitter, packet loss rate, Jain's fairness index, and handover count), so the learned policy explicitly balances efficiency and stability under user mobility and noisy observations. The PPO agent uses an actor-critic neural network trained from trajectories generated by the Python simulator with configurable mobility (e.g., Gauss-Markov) and stochastic measurement noise. Across 500+ training episodes and stress tests with increasing user density, the PPO policy consistently improves KPI trends (higher throughput and fairness, lower delay, jitter, packet loss, and handovers) and exhibits rapid, stable convergence. Comparative evaluations show that PPO outperforms rule-based ReBuHa and A3 as well as the learning-based CDQL baseline across all KPIs while maintaining smoother learning dynamics and stronger generalization as load increases. These results indicate that PPO's clipped policy updates and advantage-based training yield robust, deployable control for next-generation RAN load balancing using an entirely Python-based toolchain.
翻译:在密集且高度动态的5G无线接入网络中,高效的移动性管理和负载均衡对于维持服务质量(QoS)至关重要。我们提出了一种基于近端策略优化(PPO)的深度强化学习框架,用于实现端到端自主QoS感知负载均衡,该框架部署于轻量级纯Python仿真环境中。控制问题被建模为马尔可夫决策过程,其中智能体周期性调整小区个体偏移(CIO)值以引导用户与小区之间的关联。多目标奖励函数综合捕获关键性能指标(聚合吞吐量、时延、抖动、丢包率、Jain公平性指标和切换次数),使得学习到的策略在用户移动性和含噪观测条件下显式平衡效率与稳定性。PPO智能体采用行动者-评论家神经网络,该网络通过Python仿真器生成的轨迹进行训练,仿真器支持可配置的移动性模型(如高斯-马尔可夫)和随机测量噪声。在500余个训练回合及随用户密度递增的应力测试中,PPO策略持续改善KPI趋势(更高吞吐量和公平性,更低时延、抖动、丢包率和切换次数),并展现出快速稳定的收敛特性。对比评估表明,PPO在所有KPI维度上均优于基于规则的ReBuHa和A3方法以及基于学习的CDQL基线,同时保持更平滑的学习动态和更强的负载扩展泛化能力。这些结果表明,PPO的截断策略更新和优势函数训练方法能够基于完全Python化的工具链,为下一代RAN负载均衡提供稳健且可部署的控制方案。