Instead of making behavioral decisions directly from the exponentially expanding joint observational-action space, subtask-based multi-agent reinforcement learning (MARL) methods enable agents to learn how to tackle different subtasks. Most existing subtask-based MARL methods are based on hierarchical reinforcement learning (HRL). However, these approaches often limit the number of subtasks, perform subtask recognition periodically, and can only identify and execute a specific subtask within the predefined fixed time period, which makes them inflexible and not suitable for diverse and dynamic scenarios with constantly changing subtasks. To break through above restrictions, a \textbf{S}liding \textbf{M}ultidimensional t\textbf{A}sk window based m\textbf{U}ti-agent reinforcement learnin\textbf{G} framework (SMAUG) is proposed for adaptive real-time subtask recognition. It leverages a sliding multidimensional task window to extract essential information of subtasks from trajectory segments concatenated based on observed and predicted trajectories in varying lengths. An inference network is designed to iteratively predict future trajectories with the subtask-oriented policy network. Furthermore, intrinsic motivation rewards are defined to promote subtask exploration and behavior diversity. SMAUG can be integrated with any Q-learning-based approach. Experiments on StarCraft II show that SMAUG not only demonstrates performance superiority in comparison with all baselines but also presents a more prominent and swift rise in rewards during the initial training stage.
翻译:与直接从指数级增长的联合观测-行为空间做出行为决策不同,基于子任务的多智能体强化学习方法使智能体能够学习如何应对不同子任务。现有的大多数基于子任务的多智能体强化学习方法都基于分层强化学习。然而,这些方法通常限制子任务数量、周期性执行子任务识别,且只能在预定义的固定时间段内识别和执行特定子任务,这使得它们缺乏灵活性,不适用于子任务持续变化的多样动态场景。为突破上述限制,本文提出了一种基于滑动多维任务窗口的自适应实时子任务识别多智能体强化学习框架SMAUG。该框架利用滑动多维任务窗口,从基于变长的观测与预测轨迹拼接而成的轨迹片段中提取子任务的关键信息。设计了一个推理网络,用于与面向子任务的策略网络协同迭代预测未来轨迹。此外,还定义了内在动机奖励以促进子任务探索和行为多样性。SMAUG可与任何基于Q学习的方法集成。在星际争霸II上的实验表明,SMAUG不仅在所有基线方法中展现出性能优势,而且训练初期奖励上升更为突出且迅速。