As large language models (LLMs) evolve into autonomous agents, evaluating repository-level reasoning, the ability to maintain logical consistency across massive, real-world, interdependent file systems, has become critical. Current benchmarks typically fluctuate between isolated code snippets and black-box evaluations. We present RepoReason, a white-box diagnostic benchmark centered on abductive assertion verification. To eliminate memorization while preserving authentic logical depth, we implement an execution-driven mutation framework that utilizes the environment as a semantic oracle to regenerate ground-truth states. Furthermore, we establish a fine-grained diagnostic system using dynamic program slicing, quantifying reasoning via three orthogonal metrics: $ESV$ (reading load), $MCL$ (simulation depth), and $DFI$ (integration width). Comprehensive evaluations of frontier models (e.g., Claude-4.5-Sonnet, DeepSeek-v3.1-Terminus) reveal a prevalent aggregation deficit, where integration width serves as the primary cognitive bottleneck. Our findings provide granular white-box insights for optimizing the next generation of agentic software engineering.
翻译:随着大型语言模型(LLMs)演化为自主智能体,评估仓库级推理能力——即在庞大、真实世界且相互依赖的文件系统中保持逻辑一致性的能力——已变得至关重要。现有基准测试通常在孤立代码片段与黑盒评估之间摇摆。我们提出了RepoReason,一个以溯因断言验证为核心的白盒诊断基准。为消除记忆效应同时保持真实的逻辑深度,我们实现了一个执行驱动的变异框架,利用环境作为语义预言机来重构真实状态。此外,我们通过动态程序切片建立了细粒度诊断系统,使用三个正交指标量化推理能力:$ESV$(读取负载)、$MCL$(模拟深度)和$DFI$(集成宽度)。对前沿模型(如Claude-4.5-Sonnet、DeepSeek-v3.1-Terminus)的综合评估揭示了普遍存在的集成缺陷,其中集成宽度成为主要的认知瓶颈。我们的研究结果为优化下一代智能体软件工程提供了细粒度的白盒见解。