Goal-oriented Transmission Scheduling: Structure-guided DRL with a Unified Dual On-policy and Off-policy Approach

Goal-oriented communications prioritize application-driven objectives over data accuracy, enabling intelligent next-generation wireless systems. Efficient scheduling in multi-device, multi-channel systems poses significant challenges due to high-dimensional state and action spaces. We address these challenges by deriving key structural properties of the optimal solution to the goal-oriented scheduling problem, incorporating Age of Information (AoI) and channel states. Specifically, we establish the monotonicity of the optimal state value function (a measure of long-term system performance) w.r.t. channel states and prove its asymptotic convexity w.r.t. AoI states. Additionally, we derive the monotonicity of the optimal policy w.r.t. channel states, advancing the theoretical framework for optimal scheduling. Leveraging these insights, we propose the structure-guided unified dual on-off policy DRL (SUDO-DRL), a hybrid algorithm that combines the stability of on-policy training with the sample efficiency of off-policy methods. Through a novel structural property evaluation framework, SUDO-DRL enables effective and scalable training, addressing the complexities of large-scale systems. Numerical results show SUDO-DRL improves system performance by up to 45% and reduces convergence time by 40% compared to state-of-the-art methods. It also effectively handles scheduling in much larger systems, where off-policy DRL fails and on-policy benchmarks exhibit significant performance loss, demonstrating its scalability and efficacy in goal-oriented communications.

翻译：面向目标的通信将应用驱动的目标置于数据准确性之上，从而赋能智能化的下一代无线系统。在多设备、多信道系统中，由于高维状态与动作空间的存在，高效调度面临重大挑战。我们通过推导面向目标的调度问题最优解的关键结构特性来应对这些挑战，该问题融合了信息年龄（AoI）与信道状态。具体而言，我们建立了最优状态价值函数（一种衡量长期系统性能的指标）关于信道状态的单调性，并证明了其关于AoI状态的渐近凸性。此外，我们推导了最优策略关于信道状态的单调性，从而推进了最优调度的理论框架。基于这些洞见，我们提出了结构引导的统一双重同策略-异策略深度强化学习（SUDO-DRL），这是一种混合算法，它结合了同策略训练的稳定性与异策略方法的样本效率。通过一种新颖的结构特性评估框架，SUDO-DRL实现了高效且可扩展的训练，以应对大规模系统的复杂性。数值结果表明，与现有先进方法相比，SUDO-DRL将系统性能提升高达45%，并将收敛时间减少40%。它还能有效处理更大规模系统中的调度问题，在这些场景下，异策略深度强化学习会失效，而同策略基准方法则表现出显著的性能损失，这证明了其在面向目标通信中的可扩展性与有效性。