Dual-Graph Multi-Agent Reinforcement Learning for Handover Optimization

Matteo Salvatori,Filippo Vannella,Sebastian Macaluso,Stylianos E. Trevlakis,Carlos Segura Perales,José Suarez-Varela,Alexandros-Apostolos A. Boulogeorgos,Ioannis Arapakis

HandOver (HO) control in cellular networks is governed by a set of HO control parameters that are traditionally configured through rule-based heuristics. A key parameter for HO optimization is the Cell Individual Offset (CIO), defined for each pair of neighboring cells and used to bias HO triggering decisions. At network scale, tuning CIOs becomes a tightly coupled problem: small changes can redirect mobility flows across multiple neighbors, and static rules often degrade under non-stationary traffic and mobility. We exploit the pairwise structure of CIOs by formulating HO optimization as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) on the network's dual graph. In this representation, each agent controls a neighbor-pair CIO and observes Key Performance Indicators (KPIs) aggregated over its local dual-graph neighborhood, enabling scalable decentralized decisions while preserving graph locality. Building on this formulation, we propose TD3-D-MA, a discrete Multi-Agent Reinforcement Learning (MARL) variant of the TD3 algorithm with a shared-parameter Graph Neural Network (GNN) actor operating on the dual graph and region-wise double critics for training, improving credit assignment in dense deployments. We evaluate TD3-D-MA in an ns-3 system-level simulator configured with real-world network operator parameters across heterogeneous traffic regimes and network topologies. Results show that TD3-D-MA improves network throughput over standard HO heuristics and centralized RL baselines, and generalizes robustly under topology and traffic shifts.

翻译：蜂窝网络中的切换控制由一组传统上通过基于规则的启发式方法配置的切换控制参数管理。切换优化的一个关键参数是小区个体偏移（CIO），该参数为每对相邻小区定义，并用于偏置切换触发决策。在网络规模下，调整CIO成为一个紧密耦合的问题：微小的变化可能将移动流重定向到多个相邻小区，而静态规则在非平稳流量和移动性条件下通常性能下降。我们利用CIO的成对结构，将切换优化建模为网络对偶图上的分散式部分可观测马尔可夫决策过程（Dec-POMDP）。在该表示中，每个智能体控制一对相邻小区的CIO，并观测其局部对偶图邻域内聚合的关键性能指标（KPI），从而在保持图局部性的同时实现可扩展的分散决策。基于此框架，我们提出TD3-D-MA算法，这是一种TD3算法的离散多智能体强化学习变体，采用在对偶图上运行的共享参数图神经网络（GNN）演员网络以及用于训练的区域对偶评论家网络，从而改善密集部署场景下的信用分配。我们在配置了真实网络运营商参数的ns-3系统级模拟器中，在异构流量模式和网络拓扑下评估了TD3-D-MA。结果表明，与标准切换启发式方法和集中式强化学习基线相比，TD3-D-MA提升了网络吞吐量，并在拓扑和流量变化下展现出鲁棒的泛化能力。