State-of-the-art methods for Human-AI Teaming and Zero-shot Cooperation focus on task completion, i.e., task rewards, as the sole evaluation metric while being agnostic to how the two agents work with each other. Furthermore, subjective user studies only offer limited insight into the quality of cooperation existing within the team. Specifically, we are interested in understanding the cooperative behaviors arising within the team when trained agents are paired with humans -- a problem that has been overlooked by the existing literature. To formally address this problem, we propose the concept of constructive interdependence -- measuring how much agents rely on each other's actions to achieve the shared goal -- as a key metric for evaluating cooperation in human-agent teams. We interpret interdependence in terms of action interactions in a STRIPS formalism, and define metrics that allow us to assess the degree of reliance between the agents' actions. We pair state-of-the-art agents HAT with learned human models as well as human participants in a user study for the popular Overcooked domain, and evaluate the task reward and teaming performance for these human-agent teams. Our results demonstrate that although trained agents attain high task rewards, they fail to induce cooperative behavior, showing very low levels of interdependence across teams. Furthermore, our analysis reveals that teaming performance is not necessarily correlated with task reward, highlighting that task reward alone cannot reliably measure cooperation arising in a team.
翻译:当前最先进的人机协作与零样本合作方法仅将任务完成度(即任务奖励)作为唯一评估指标,而忽略了两个智能体如何协同工作。此外,主观用户研究仅能有限地揭示团队内部合作的质量。具体而言,我们关注当训练有素的智能体与人类配对时,团队内部产生的合作行为——这一直是现有文献忽视的问题。为正式解决该问题,我们提出"建设性相互依赖"的概念——即衡量智能体为实现共同目标在多大程度上依赖彼此的行为——作为评估人机团队合作质量的关键指标。我们基于STRIPS形式化框架中的动作交互来阐释相互依赖性,并定义了一系列能够评估智能体间行为依赖程度的量化指标。在广受欢迎的《Overcooked》游戏场景中,我们将最先进的HAT智能体与经过学习的人类模型以及真实人类参与者(通过用户研究)进行配对,评估这些人机团队的任务奖励与协作表现。实验结果表明:尽管训练有素的智能体能够获得较高的任务奖励,但它们未能激发合作行为,在跨团队协作中表现出极低的相互依赖性。进一步分析表明,团队协作表现与任务奖励之间并不必然相关,这凸显了仅靠任务奖励无法可靠衡量团队中产生的合作水平。