We study learning multi-task, multi-agent policies for cooperative, temporal objectives, under centralized training, decentralized execution. In this setting, using automata to represent tasks assigned to agents enables breaking down a team-level objective into simpler, smaller sub-tasks. However, existing approaches remain sample-inefficient and are limited to the single-task case, requiring retraining policies for each new task. In this work, we present Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning (ACC-MARL), a framework for learning task-conditioned, decentralized team policies. We identify challenges to the feasibility of ACC-MARL, propose solutions, and prove that our approach is optimal. We further show that learned value functions can be used to assign tasks optimally at test time. Experiments demonstrate emergent task-aware, multi-step coordination among agents, such as pressing a button to unlock a door, holding the door, and short-circuiting tasks.
翻译:我们研究在集中训练、分散执行的框架下,学习面向协作时序目标的多任务多智能体策略。在该场景中,利用自动机表示分配给智能体的任务,能够将团队层级的目标分解为更简单、更小的子任务。然而,现有方法仍存在样本效率低下的问题,且局限于单任务场景——每遇到新任务都需要重新训练策略。为此,我们提出基于自动机条件的多智能体协同强化学习(ACC-MARL),一种学习任务条件化分散式团队策略的框架。我们识别了ACC-MARL可行性面临的挑战,提出解决方案,并证明该方法具有最优性。进一步研究表明,训练获得的价值函数可在测试阶段用于最优任务分配。实验展现了智能体间涌现的任务感知型多步协作能力,例如按压按钮解锁门、扶持门以及短路任务等行为。