Nested Training for Mutual Adaptation in Human-AI Teaming

Mutual adaptation is a central challenge in human--AI teaming, as humans naturally adjust their strategies in response to a robot's policy. Existing approaches aim to improve diversity in training partners to approximate human behavior, but these partners are static and fail to capture adaptive behavior of humans. Exposing robots to adaptive behaviors is critical, yet when both agents learn simultaneously in a multi-agent setting, they often converge to opaque implicit coordination strategies that only work with the agents they were co-trained with. Such agents fail to generalize when paired with new partners. In order to capture the adaptive behavior of humans, we model the human-robot teaming scenario as an Interactive Partially Observable Markov Decision Process (I-POMDP), explicitly modeling human adaptation as part of the state. We propose a nested training regime to approximately learn the solution to a finite-level I-POMDP. In this framework, agents at each level are trained against adaptive agents from the level below. This ensures that the ego agent is exposed to adaptive behavior during training while avoiding the emergence of implicit coordination strategies, since the training partners are not themselves learning. We train our method in a multi-episode, required cooperation setup in the Overcooked domain, comparing it against several baseline agents designed for human-robot teaming. We evaluate the performance of our agent when paired with adaptive partners that were not seen during training. Our results demonstrate that our agent not only achieves higher task performance with these adaptive partners but also exhibits significantly greater adaptability during team interactions.

翻译：互适应是人机协作中的核心挑战，因为人类会自然地根据机器人的策略调整自身行为。现有方法旨在通过提升训练伙伴的多样性来近似人类行为，但这些伙伴是静态的，无法捕捉人类的适应性行为。让机器人接触适应性行为至关重要，然而当多智能体环境中的双方同时学习时，它们往往会收敛到不透明的隐式协调策略，这些策略仅适用于共同训练的智能体。此类智能体在与新伙伴配对时无法实现泛化。为捕捉人类的适应性行为，我们将人机协作场景建模为交互式部分可观测马尔可夫决策过程（I-POMDP），将人类适应行为显式建模为状态的一部分。我们提出一种嵌套训练机制来近似学习有限层级I-POMDP的解。在此框架中，每个层级的智能体都与下一层级的适应性智能体进行对抗训练。这确保了自我智能体在训练过程中接触适应性行为，同时避免了隐式协调策略的出现——因为训练伙伴本身并不进行学习。我们在《Overcooked》领域中的多回合强制合作场景中训练该方法，并与多个人机协作基线智能体进行对比。我们评估了智能体与训练过程中未出现的适应性伙伴配对时的表现。实验结果表明，我们的智能体不仅在这些适应性伙伴配合下取得更高任务性能，还在团队交互中展现出显著更强的适应能力。