Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution capabilities essential for realistic software engineering. To address these gaps, we introduce LongCLI-Bench, a comprehensive benchmark designed to evaluate agentic capabilities across long-horizon, realistic tasks. We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. We propose a dual-set testing protocol for LongCLI-Bench, which measures requirement fulfillment (fail-to-pass) and regression avoidance (pass-to-pass), and incorporates step-level scoring to pinpoint execution failures. Extensive experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench. Step-level analysis further indicates that the majority of tasks stall at less than 30% completion, highlighting that critical failures often occur in the early stages. Although self-correction offers marginal gains, human-agent collaboration through plan injection and interactive guidance yields significantly higher improvements. These results highlight that future research must emphasize the development of synergistic human-agent workflows alongside advances in agents' planning and execution capabilities to overcome key challenges in long-horizon task performance.
翻译:人工智能辅助编程的最新进展已使智能体能够通过命令行界面执行复杂工作流,然而,现有基准受限于任务视野较短、GitHub数据抓取导致的数据污染以及缺乏细粒度评估指标,无法严格评估现实软件工程所必需的长程规划与执行能力。为弥补这些不足,我们提出了LongCLI-Bench——一个旨在评估长程现实任务中智能体能力的综合性基准。我们从超过1000项计算机科学作业和真实工作流中精心筛选出20个高质量长程任务,涵盖四大工程类别:从零构建、功能添加、缺陷修复和代码重构。我们为LongCLI-Bench设计了一套双集测试协议,该协议通过需求满足度(从失败到通过)和回归规避度(从通过到保持通过)进行度量,并引入步骤级评分以精确定位执行故障。大量实验表明,即使最先进的智能体在LongCLI-Bench中的通过率也低于20%。步骤级分析进一步揭示,绝大多数任务在完成度不足30%时即陷入停滞,说明关键故障常发生于早期阶段。尽管自我纠错能带来有限提升,但通过计划注入和交互式指导实现的人机协作则能产生显著更高的改进。这些结果表明,未来研究必须着重发展协同式人机工作流,同时推进智能体的规划与执行能力,以克服长程任务性能中的关键挑战。