Due to network delays and scalability limitations, clustered ad hoc networks widely adopt Reinforcement Learning (RL) for on-demand resource allocation. Albeit its demonstrated agility, traditional Model-Free RL (MFRL) solutions struggle to tackle the huge action space, which generally explodes exponentially along with the number of resource allocation units, enduring low sampling efficiency and high interaction cost. In contrast to MFRL, Model-Based RL (MBRL) offers an alternative solution to boost sample efficiency and stabilize the training by explicitly leveraging a learned environment model. However, establishing an accurate dynamic model for complex and noisy environments necessitates a careful balance between model accuracy and computational complexity $\&$ stability. To address these issues, we propose a Conditional Diffusion Model Planner (CDMP) for high-dimensional offline resource allocation in clustered ad hoc networks. By leveraging the astonishing generative capability of Diffusion Models (DMs), our approach enables the accurate modeling of high-quality environmental dynamics while leveraging an inverse dynamics model to plan a superior policy. Beyond simply adopting DMs in offline RL, we further incorporate the CDMP algorithm with a theoretically guaranteed, uncertainty-aware penalty metric, which theoretically and empirically manifests itself in mitigating the Out-of-Distribution (OOD)-induced distribution shift issue underlying scarce training data. Extensive experiments also show that our model outperforms MFRL in average reward and Quality of Service (QoS) while demonstrating comparable performance to other MBRL algorithms.
翻译:由于网络延迟与可扩展性限制,集群自组织网络广泛采用强化学习进行按需资源分配。尽管传统无模型强化学习方法展现出灵活性,但其难以应对巨大的动作空间——该空间通常随资源分配单元数量呈指数级增长,导致采样效率低下且交互成本高昂。与无模型强化学习相比,基于模型的强化学习通过显式利用学习的环境模型,为提高样本效率与稳定训练提供了替代方案。然而,为复杂嘈杂环境建立精确动态模型需要在模型精度与计算复杂度及稳定性之间取得谨慎平衡。为解决这些问题,本文提出一种用于集群自组织网络高维离线资源分配的条件扩散模型规划器。通过利用扩散模型卓越的生成能力,我们的方法能够精确建模高质量环境动态,同时借助逆动力学模型规划更优策略。除在离线强化学习中应用扩散模型外,我们进一步将CDMP算法与具有理论保证的不确定性感知惩罚度量相结合,该度量在理论与实证层面均证明能缓解稀缺训练数据中由分布外问题引发的分布偏移。大量实验表明,我们的模型在平均奖励与服务质量方面优于无模型强化学习,同时与其他基于模型的强化学习算法表现出相当的性能。