Enhancing Spectrum Efficiency in 6G Satellite Networks: A GAIL-Powered Policy Learning via Asynchronous Federated Inverse Reinforcement Learning

In this paper, a novel generative adversarial imitation learning (GAIL)-powered policy learning approach is proposed for optimizing beamforming, spectrum allocation, and remote user equipment (RUE) association in NTNs. Traditional reinforcement learning (RL) methods for wireless network optimization often rely on manually designed reward functions, which can require extensive parameter tuning. To overcome these limitations, we employ inverse RL (IRL), specifically leveraging the GAIL framework, to automatically learn reward functions without manual design. We augment this framework with an asynchronous federated learning approach, enabling decentralized multi-satellite systems to collaboratively derive optimal policies. The proposed method aims to maximize spectrum efficiency (SE) while meeting minimum information rate requirements for RUEs. To address the non-convex, NP-hard nature of this problem, we combine the many-to-one matching theory with a multi-agent asynchronous federated IRL (MA-AFIRL) framework. This allows agents to learn through asynchronous environmental interactions, improving training efficiency and scalability. The expert policy is generated using the Whale optimization algorithm (WOA), providing data to train the automatic reward function within GAIL. Simulation results show that the proposed MA-AFIRL method outperforms traditional RL approaches, achieving a $14.6\%$ improvement in convergence and reward value. The novel GAIL-driven policy learning establishes a novel benchmark for 6G NTN optimization.

翻译：本文提出了一种新颖的生成对抗模仿学习驱动的策略学习方法，用于优化非地面网络中的波束成形、频谱分配和远程用户设备关联。用于无线网络优化的传统强化学习方法通常依赖于人工设计的奖励函数，这往往需要进行大量的参数调整。为克服这些限制，我们采用逆强化学习，特别是利用生成对抗模仿学习框架，以自动学习奖励函数，无需人工设计。我们通过异步联邦学习方法增强该框架，使去中心化的多卫星系统能够协作推导出最优策略。所提方法旨在最大化频谱效率，同时满足远程用户设备的最低信息速率要求。为解决该问题的非凸、NP难特性，我们将多对一匹配理论与多智能体异步联邦逆强化学习框架相结合。这使得智能体能够通过异步的环境交互进行学习，从而提高了训练效率和可扩展性。专家策略通过鲸鱼优化算法生成，为生成对抗模仿学习框架内的自动奖励函数训练提供数据。仿真结果表明，所提出的多智能体异步联邦逆强化学习方法优于传统强化学习方法，在收敛性和奖励值方面实现了14.6%的提升。这种新颖的生成对抗模仿学习驱动的策略学习为6G非地面网络优化确立了新的基准。