LLM-powered coding agents are reshaping the development paradigm. However, existing evaluation systems, neither traditional tests for humans nor benchmarks for LLMs, fail to capture this shift. They remain focused on well-defined algorithmic problems, which excludes problems where success depends on human-AI collaboration. Such collaborative problems not only require human reasoning to interpret complex contexts and guide solution strategies, but also demand AI efficiency for implementation. To bridge this gap, we introduce HAI-Eval, a unified benchmark designed to measure the synergy of human-AI partnership in coding. HAI-Eval's core innovation is its "Collaboration-Necessary" problem templates, which are intractable for both standalone LLMs and unaided humans, but solvable through effective collaboration. Specifically, HAI-Eval uses 45 templates to dynamically create tasks. It also provides a standardized IDE for human participants and a reproducible toolkit with 450 task instances for LLMs, ensuring an ecologically valid evaluation. We conduct a within-subject study with 45 participants and benchmark their performance against 5 state-of-the-art LLMs under 4 different levels of human intervention. Results show that standalone LLMs and unaided participants achieve poor pass rates (0.67% and 18.89%), human-AI collaboration significantly improves performance to 31.11%. Our analysis reveals an emerging co-reasoning partnership. This finding challenges the traditional human-tool hierarchy by showing that strategic breakthroughs can originate from either humans or AI. HAI-Eval establishes not only a challenging benchmark for next-generation coding agents but also a grounded, scalable framework for assessing core developer competencies in the AI era. Our benchmark and interactive demo will be openly accessible.
翻译:以大型语言模型驱动的编程智能体正在重塑软件开发范式。然而,现有评估体系——无论是面向人类的传统测试,还是针对大语言模型的基准测试——均未能捕捉这一变革。这些评估仍聚焦于定义明确的算法问题,排除了依赖人机协作才能成功的问题场景。此类协作问题既需要人类推理能力以解读复杂情境并引导解决策略,又要求人工智能具备高效的实现能力。为弥合这一差距,我们提出HAI-Eval,这是一个用于测量人机编程协作协同效应的统一基准测试。HAI-Eval的核心创新在于其"协作必需型"问题模板——这类问题对独立运行的大语言模型和未受辅助的人类均难以解决,但通过有效协作即可攻克。具体而言,HAI-Eval采用45个模板动态生成任务,同时为人类参与者提供标准化集成开发环境,并为大语言模型提供包含450个任务实例的可复现工具包,确保生态效度评估。我们开展了一项包含45名参与者的受试者内实验,将其表现与5个最先进的大语言模型在4种不同层级的人工干预下进行对比。结果表明,独立运行的大语言模型和未受辅助的参与者表现不佳(通过率分别为0.67%和18.89%),而人机协作将性能显著提升至31.11%。我们的分析揭示了一种新兴的协同推理伙伴关系。这一发现表明战略突破可源自人类或人工智能的任何一方,从而挑战了传统的人机层级观念。HAI-Eval不仅为下一代编程智能体建立了具有挑战性的基准,更构建了评估人工智能时代开发者核心能力的扎根、可扩展框架。我们的基准测试与交互式演示将开源公开。