Recent advancements in language models (LMs) have sparked growing interest in developing LM agents. While fully autonomous agents could excel in many scenarios, numerous use cases inherently require them to collaborate with humans due to humans' latent preferences, domain expertise, or need for control. To facilitate the study of human-agent collaboration, we present Collaborative Gym (Co-Gym), a general framework enabling asynchronous, tripartite interaction among agents, humans, and task environments. We instantiate Co-Gym with three representative tasks in both simulated and real-world conditions, and propose an evaluation framework that assesses both the collaboration outcomes and processes. Our findings reveal that collaborative agents consistently outperform their fully autonomous counterparts in task performance within those delivered cases, achieving win rates of 86% in Travel Planning, 74% in Tabular Analysis, and 66% in Related Work when evaluated by real users. However, our study also highlights significant challenges in developing collaborative agents, requiring advancements in core aspects of intelligence -- communication capabilities, situational awareness, and balancing autonomy and human control.
翻译:语言模型(LM)的最新进展激发了人们对开发LM智能体的日益增长的兴趣。虽然完全自主的智能体在许多场景下可能表现出色,但由于人类的潜在偏好、领域专业知识或对控制的需求,大量用例本质上要求智能体与人类协作。为了促进人机协作的研究,我们提出了协同健身房(Co-Gym),这是一个支持智能体、人类和任务环境之间进行异步三方交互的通用框架。我们在模拟和现实条件下,通过三个代表性任务实例化了Co-Gym,并提出了一个评估框架,该框架同时评估协作结果和协作过程。我们的研究结果表明,在已交付的案例中,协作智能体在任务性能上始终优于完全自主的智能体,经真实用户评估,其在旅行规划、表格分析及相关工作撰写任务中的胜率分别达到86%、74%和66%。然而,我们的研究也凸显了开发协作智能体所面临的重大挑战,需要在智能的核心方面——沟通能力、态势感知以及平衡自主性与人类控制——取得进展。