Deep Reinforcement Learning Approach to QoSAware Load Balancing in 5G Cellular Networks under User Mobility and Observation Uncertainty

Efficient mobility management and load balancing are critical to sustaining Quality of Service (QoS) in dense, highly dynamic 5G radio access networks. We present a deep reinforcement learning framework based on Proximal Policy Optimization (PPO) for autonomous, QoS-aware load balancing implemented end-to-end in a lightweight, pure-Python simulation environment. The control problem is formulated as a Markov Decision Process in which the agent periodically adjusts Cell Individual Offset (CIO) values to steer user-cell associations. A multi-objective reward captures key performance indicators (aggregate throughput, latency, jitter, packet loss rate, Jain's fairness index, and handover count), so the learned policy explicitly balances efficiency and stability under user mobility and noisy observations. The PPO agent uses an actor-critic neural network trained from trajectories generated by the Python simulator with configurable mobility (e.g., Gauss-Markov) and stochastic measurement noise. Across 500+ training episodes and stress tests with increasing user density, the PPO policy consistently improves KPI trends (higher throughput and fairness, lower delay, jitter, packet loss, and handovers) and exhibits rapid, stable convergence. Comparative evaluations show that PPO outperforms rule-based ReBuHa and A3 as well as the learning-based CDQL baseline across all KPIs while maintaining smoother learning dynamics and stronger generalization as load increases. These results indicate that PPO's clipped policy updates and advantage-based training yield robust, deployable control for next-generation RAN load balancing using an entirely Python-based toolchain.

翻译：高效移动性管理与负载均衡对于维持密集高动态5G无线接入网络的服务质量至关重要。本文提出一种基于近端策略优化（PPO）的深度强化学习框架，用于在轻量级纯Python仿真环境中实现端到端的自主QoS感知负载均衡。该控制问题被建模为马尔可夫决策过程，其中智能体周期性调整小区个体偏移值以引导用户-小区关联。通过多目标奖励函数捕获关键性能指标（总吞吐量、时延、抖动、丢包率、Jain公平性指数及切换次数），使得学习策略能够在用户移动性和噪声观测条件下显式平衡效率与稳定性。PPO智能体采用基于演员-评论家架构的神经网络，通过Python仿真器生成的可配置移动性模型（如高斯-马尔可夫）与随机测量噪声轨迹进行训练。经过500余次训练周期及用户密度递增的压力测试，PPO策略持续改善KPI趋势（更高吞吐量与公平性、更低时延、抖动、丢包率及切换次数），并表现出快速稳定的收敛特性。对比评估表明，PPO在所有KPI上均优于基于规则的ReBuHa和A3方法以及基于学习的CDQL基线，同时在负载增加时保持更平滑的学习动态与更强的泛化能力。这些结果表明，PPO的裁剪策略更新与基于优势值的训练机制，结合全Python工具链，可为下一代无线接入网络负载均衡提供鲁棒且可部署的控制方案。