Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution capabilities essential for realistic software engineering. To address these gaps, we introduce LongCLI-Bench, a comprehensive benchmark designed to evaluate agentic capabilities across long-horizon, realistic tasks. We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. We propose a dual-set testing protocol for LongCLI-Bench, which measures requirement fulfillment (fail-to-pass) and regression avoidance (pass-to-pass), and incorporates step-level scoring to pinpoint execution failures. Extensive experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench. Step-level analysis further indicates that the majority of tasks stall at less than 30% completion, highlighting that critical failures often occur in the early stages. Although self-correction offers marginal gains, human-agent collaboration through plan injection and interactive guidance yields significantly higher improvements. These results highlight that future research must emphasize the development of synergistic human-agent workflows alongside advances in agents' planning and execution capabilities to overcome key challenges in long-horizon task performance.
翻译:近期,人工智能辅助编程的进展已使智能体能够通过命令行界面执行复杂工作流。然而,现有基准受限于任务视野较短、GitHub数据抓取导致的数据污染以及缺乏细粒度评估指标,无法严格评估现实软件工程中至关重要的长程规划与执行能力。为填补这些空白,我们提出了LongCLI-Bench,这是一个旨在评估智能体在长程、现实任务中能力的综合性基准。我们从超过1000项计算机科学作业和真实工作流中精心筛选出20个高质量长程任务,涵盖四大工程类别:从零构建、功能添加、缺陷修复与代码重构。我们为LongCLI-Bench设计了一套双重测试协议,用于衡量需求满足度(从失败到通过)与回归规避度(从通过到保持通过),并引入步骤级评分以精确定位执行故障。大量实验表明,即使在LongCLI-Bench中,最先进的智能体通过率也低于20%。步骤级分析进一步指出,大多数任务在完成度不足30%时便陷入停滞,凸显关键故障常发生于早期阶段。尽管自我纠错能带来有限提升,但通过计划注入和交互式指导实现的人机协作则能显著提高性能。这些结果表明,未来的研究必须着重发展协同式人机工作流,同时推进智能体的规划与执行能力,以克服长程任务性能中的关键挑战。