AI coding agents powered by large language models can read codebases and produce functional code, but they routinely violate team-specific product decisions that are invisible in the source code alone. We introduce a controlled benchmark measuring decision compliance, the rate at which an AI coding agent follows established product, design, and engineering decisions, across 8 realistic software engineering tasks containing 41 weighted decision points. We compare a baseline configuration (Claude Code with codebase access only) against an augmented configuration that adds Brief, a product-context retrieval system providing spec generation, mid-build consultation, and retrieval of recorded decisions, persona pain points, customer signals, and competitive intelligence. On identical prompts and the same repository, the augmented configuration achieves 95% decision compliance versus 46% for the baseline, a 49 percentage point improvement. Per-decision analysis reveals that the baseline achieves 100% compliance on decisions visible in the codebase and 0-33% on decisions requiring product context, suggesting that product-context retrieval is a key driver of the improvement. We release the benchmark repository, all 16 pull requests, and scoring harness for independent reproduction.
翻译:由大语言模型驱动的AI编码代理能够读取代码库并生成功能性代码,但常违反仅在源代码中不可见的团队特定产品决策。我们引入了一个受控基准测试,用于衡量决策遵循率(即AI编码代理遵循既定产品、设计与工程决策的比率),该测试覆盖8个真实软件工程任务,包含41个加权决策点。我们将基线配置(仅具备代码库访问权限的Claude Code)与增强配置(增加产品上下文检索系统Brief,提供规格生成、构建中咨询、记录决策检索、用户痛点、客户信号与竞争情报功能)进行比较。在相同提示词与同一代码库条件下,增强配置的决策遵循率达95%,而基线为46%,提升幅度为49个百分点。逐决策分析表明,基线在代码库可见决策上实现100%遵循率,但在需要产品上下文的决策上仅为0-33%,这表明产品上下文检索是改进的关键驱动因素。我们开源了基准测试仓库、全部16个拉取请求及评分工具,以供独立复现。