AI code agents excel at isolated tasks yet struggle with complex, multi-file software engineering requiring understanding of how dozens of modules relate. We hypothesize these failures stem from inability to construct, maintain, and update coherent architectural beliefs during codebase exploration. We introduce Theory of Code Space (ToCS), a benchmark that evaluates this capability by placing agents in procedurally generated codebases under partial observability, requiring them to build structured belief states over module dependencies, cross-cutting invariants, and design intent. The framework features: (1) a procedural codebase generator producing medium-complexity Python projects with four typed edge categories reflecting different discovery methods -- from syntactic imports to config-driven dynamic wiring -- with planted architectural constraints and verified ground truth; (2) a partial observability harness where agents explore under a budget; and (3) periodic belief probing via structured JSON, producing a time-series of architectural understanding. We decompose the Active-Passive Gap from spatial reasoning benchmarks into selection and decision components, and introduce Architectural Constraint Discovery as a code-specific evaluation dimension. Preliminary experiments with four rule-based baselines and five frontier LLM agents from three providers validate discriminative power: methods span a wide performance range (F1 from 0.129 to 0.646), LLM agents discover semantic edge types invisible to all baselines, yet weaker models score below simple heuristics -- revealing that belief externalization, faithfully serializing internal understanding into structured JSON, is itself a non-trivial capability and a first-order confounder in belief-probing benchmarks. Open-source toolkit: https://github.com/che-shr-cat/tocs
翻译:AI代码代理在独立任务上表现出色,但在需要理解数十个模块间关联的复杂多文件软件工程中却举步维艰。我们假设这些失败源于其在代码库探索过程中无法构建、维护和更新连贯的架构认知。本文提出代码空间理论(ToCS)基准测试,通过将智能体置于部分可观测的程序生成代码库中,要求其构建关于模块依赖关系、横切不变式与设计意图的结构化认知状态,从而评估这种能力。该框架包含:(1)程序化代码库生成器,可生成中等复杂度的Python项目,包含四种反映不同发现方法的类型化边类别——从语法导入到配置驱动的动态连接——并植入架构约束与经验证的真实基准;(2)部分可观测约束环境,智能体需在有限资源下进行探索;(3)通过结构化JSON进行周期性认知探测,生成架构理解的时间序列数据。我们将空间推理基准中的主动-被动差距分解为选择与决策两个组成部分,并引入架构约束发现作为代码专属的评估维度。对四种基于规则的基线方法和三家供应商提供的五种前沿LLM代理的初步实验验证了区分能力:方法性能覆盖广泛区间(F1分数从0.129至0.646),LLM代理发现了所有基线方法均无法检测的语义边类型,但较弱模型的得分低于简单启发式方法——这表明认知外化(将内部理解忠实序列化为结构化JSON)本身即是非平凡的能力,也是认知探测基准中的一阶混杂因素。开源工具包:https://github.com/che-shr-cat/tocs