FeatBench: Towards More Realistic Evaluation of Feature-level Code Generation

Evaluating Large Language Models (LLMs) on repository-level feature implementation is a critical frontier in software engineering. However, establishing a benchmark that faithfully mirrors realistic development scenarios remains a significant challenge. Existing feature-level benchmarks generally suffer from two primary limitations: unrealistic task inputs enriched with code hints and significant data leakage risks due to their static nature. To address these limitations, we propose a new benchmark - FeatBench, which introduces the following advances: (1) Realistic Task Inputs. Task inputs consist solely of natural language requirements, strictly devoid of code hints (e.g., function signatures). This format mirrors realistic software development by requiring agents to independently bridge the gap between abstract user intent and concrete code changes. (2) Evolving Data. FeatBench employs a fully automated pipeline to construct new benchmark versions from the latest repositories, effectively mitigating data contamination. The initial release comprises 157 tasks sourced from 27 actively maintained repositories. We evaluate two state-of-the-art agent frameworks with four leading LLMs on FeatBench. The results reveal that FeatBench poses a significant challenge, with the highest resolved rate reaching only 29.94%. Crucially, our analysis uncovers a prevalent behavioral pattern of aggressive implementation, which leads to "scope creep" and widespread regressions where agents break existing features by diverging from the user's explicit intent. We release FeatBench, our automated pipeline, and all experimental results to facilitate further community research.

翻译：在软件工程领域，评估大型语言模型（LLM）在仓库级特征实现任务上的性能是一个关键前沿。然而，建立一个能忠实反映真实开发场景的基准测试仍面临重大挑战。现有特征级基准测试普遍存在两个主要局限：一是任务输入因包含代码提示（如函数签名）而脱离实际；二是由于其静态性质存在显著的数据泄露风险。为应对这些局限，我们提出了一个新的基准测试——FeatBench，其引入了以下改进：（1）真实的任务输入。任务输入仅包含自然语言需求，严格避免任何代码提示。这种格式通过要求智能体独立弥合抽象用户意图与具体代码变更之间的鸿沟，真实反映了软件开发过程。（2）动态演进的数据。FeatBench采用全自动化流水线从最新仓库中构建新的基准版本，有效缓解了数据污染问题。初始版本包含来自27个活跃维护仓库的157项任务。我们在FeatBench上评估了两种最先进的智能体框架与四种领先的LLM。结果表明，FeatBench构成了显著挑战，最高解决率仅为29.94%。关键的是，我们的分析揭示了一种普遍的激进实现行为模式，这导致“范围蔓延”和广泛的回归问题——智能体因偏离用户的明确意图而破坏了现有功能。我们开源FeatBench、自动化流水线及全部实验结果，以促进社区进一步研究。