Reinforcement learning (RL) has been widely adopted for controlling and optimizing complex engineering systems such as next-generation wireless networks. An important challenge in adopting RL is the need for direct access to the physical environment. This limitation is particularly severe in multi-agent systems, for which conventional multi-agent reinforcement learning (MARL) requires a large number of coordinated online interactions with the environment during training. When only offline data is available, a direct application of online MARL schemes would generally fail due to the epistemic uncertainty entailed by the lack of exploration during training. In this work, we propose an offline MARL scheme that integrates distributional RL and conservative Q-learning to address the environment's inherent aleatoric uncertainty and the epistemic uncertainty arising from the use of offline data. We explore both independent and joint learning strategies. The proposed MARL scheme, referred to as multi-agent conservative quantile regression, addresses general risk-sensitive design criteria and is applied to the trajectory planning problem in drone networks, showcasing its advantages.
翻译:强化学习(RL)已被广泛用于控制和优化复杂工程系统,例如下一代无线网络。采用RL的一个重要挑战是需要直接访问物理环境。这一限制在多智能体系统中尤为严重,因为传统的多智能体强化学习(MARL)在训练期间需要与环境进行大量协调的在线交互。当仅有离线数据可用时,直接应用在线MARL方案通常会失败,这是由于训练期间缺乏探索所带来的认知不确定性。在这项工作中,我们提出了一种离线MARL方案,该方案结合了分布式RL和保守Q学习,以应对环境固有的偶然不确定性以及使用离线数据所产生的认知不确定性。我们探索了独立学习和联合学习两种策略。所提出的MARL方案,称为多智能体保守分位数回归,可处理一般的风险敏感设计准则,并应用于无人机网络中的轨迹规划问题,展示了其优势。