Software Delegation Contracts: Measuring Reviewability in AI Coding-Agent Work

AI coding agents increasingly accept assigned software tasks, modify repositories under bounded authority, and return work packages for review. Prior work proposed the software delegation contract, covering the task, authority, returned work package, and acceptance context, as the unit of analysis for delegated coding work, but did not measure its effects. This paper reports a controlled pilot study of explicit delegation contracts for coding agents. We built a dependency-free TypeScript API task environment with seeded defects and documentation gaps, authored ten tasks across five families, and ran 64 agent executions across two model tiers under three conditions: a realistic issue-style prompt, an explicit delegation contract, and a contract with a required evidence bundle. Each run was scored with hidden acceptance tests, mutation checks, and scope analysis, then reviewed by three independent condition-blinded model-based reviewers using a fixed rubric, for 192 reviews. Explicit contracts did not improve objective task outcomes: all 64 runs passed hidden acceptance checks, with zero scope violations. They did improve reviewability. Evidence sufficiency improved in 22 of 30 paired comparisons and worsened in none (+0.83 on a 5-point scale, p < 0.0001, Cliff's delta = 0.66); reviewer ambiguity decreased (p = 0.035); changed-file lists, known-limitations sections, residual-risk sections, and reviewer checklists appeared mostly or only when demanded by the contract. Contracts cost +13% agent tokens and +38% wall-clock time, with larger effects for the weaker model tier. On these small tasks, delegation contracts bought reviewability rather than correctness.

翻译：AI编码代理日益接受分配的任务，在有限权限范围内修改代码仓库，并返回工作包供审查。先前的研究提出将软件委托契约（涵盖任务、权限、返回的工作包和验收上下文）作为委托编码工作的分析单元，但未测量其效果。本文报告了一项关于编码代理显式委托契约的受控试点研究。我们构建了一个无依赖的TypeScript API任务环境，其中包含预埋缺陷和文档缺口，编写了涵盖五个类别的十个任务，并在三种条件下运行了64次代理执行（涉及两个模型层级）：现实问题式提示、显式委托契约，以及附带必需证据集的契约。每次运行通过隐藏验收测试、突变检查和范围分析进行评分，随后由三位独立的条件盲审模型评审员使用固定评分标准进行审查，共计192次评估。显式契约并未改善客观任务结果：全部64次运行均通过隐藏验收检查，且无范围违规。但它们确实提升了可审查性。在30组成对比较中，证据充分性在22组中得到改善且无恶化（5分量表上提高0.83分，p < 0.0001，Cliff's delta = 0.66）；评审员模糊性降低（p = 0.035）；变更文件列表、已知限制部分、残留风险部分和评审员检查表大多仅在契约要求时出现。契约导致代理令牌消耗增加13%，运行时间增加38%，且对较弱模型层级的影响更大。在这些小型任务中，委托契约换取的是可审查性而非正确性。