While the advancement of large language models has spurred the development of AI agents to automate tasks, numerous use cases inherently require agents to collaborate with humans due to humans' latent preferences, domain expertise, or the need for control. To facilitate the study of human-agent collaboration, we introduce Collaborative Gym (Co-Gym), an open framework for developing and evaluating collaborative agents that engage in bidirectional communication with humans while interacting with task environments. We describe how the framework enables the implementation of new task environments and coordination between humans and agents through a flexible, non-turn-taking interaction paradigm, along with an evaluation suite that assesses both collaboration outcomes and processes. Our framework provides both a simulated condition with a reliable user simulator and a real-world condition with an interactive web application. Initial benchmark experiments across three representative tasks -- creating travel plans, writing related work sections, and analyzing tabular data -- demonstrate the benefits of human-agent collaboration: The best-performing collaborative agents consistently outperform their fully autonomous counterparts in task performance, achieving win rates of 86% in Travel Planning, 74% in Tabular Analysis, and 66% in Related Work when evaluated by real users. Despite these improvements, our evaluation reveals persistent limitations in current language models and agents, with communication and situational awareness failures observed in 65% and 40% of cases in the real condition, respectively. Released under the permissive MIT license, Co-Gym supports the addition of new task environments and can be used to develop collaborative agent applications, while its evaluation suite enables assessment and improvement of collaborative agents.
翻译:尽管大型语言模型的进步推动了AI代理自动化任务的发展,但由于人类的潜在偏好、领域专业知识或控制需求,许多用例本质上要求代理与人类协作。为促进人机协作研究,我们引入了协作健身房(Co-Gym),这是一个用于开发和评估协作代理的开放框架,这些代理在与任务环境交互的同时与人类进行双向通信。我们描述了该框架如何通过灵活的非轮转交互范式实现新任务环境的实施以及人与代理之间的协调,并配备了一套评估协作结果与过程的评估套件。我们的框架提供了两种条件:一种包含可靠用户模拟器的模拟条件,另一种包含交互式Web应用程序的真实世界条件。在三个代表性任务——制定旅行计划、撰写相关工作章节和分析表格数据——上的初步基准实验证明了人机协作的优势:表现最佳的协作代理在任务性能上始终优于完全自主的对应代理,在真实用户评估中,旅行规划任务的胜率达到86%,表格分析任务为74%,相关工作撰写任务为66%。尽管有这些改进,我们的评估揭示了当前语言模型和代理的持续局限性:在真实条件下,通信失败和情境感知失败分别出现在65%和40%的案例中。Co-Gym以宽松的MIT许可证发布,支持新增任务环境,可用于开发协作代理应用,而其评估套件则支持协作代理的评估与改进。