AI coding agents can resolve real-world software issues, yet they frequently introduce regressions, breaking tests that previously passed. Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied. This paper presents TDAD (Test-Driven Agentic Development), an open-source tool and benchmark methodology that combines abstract-syntax-tree (AST) based code-test graph construction with weighted impact analysis to surface the tests most likely affected by a proposed change. Evaluated on SWE-bench Verified with two local models (Qwen3-Coder 30B on 100 instances and Qwen3.5-35B-A3B on 25 instances), TDAD's GraphRAG workflow reduced test-level regressions by 70% (6.08% to 1.82%) and improved resolution from 24% to 32% when deployed as an agent skill. A surprising finding is that TDD prompting alone increased regressions (9.94%), revealing that smaller models benefit more from contextual information (which tests to verify) than from procedural instructions (how to do TDD). An autonomous auto-improvement loop raised resolution from 12% to 60% on a 10-instance subset with 0% regression. These findings suggest that for AI agent tool design, surfacing contextual information outperforms prescribing procedural workflows. All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD.
翻译:[translated abstract in Chinese]
AI编码智能体能够解决现实世界中的软件问题,但常常引入回归问题,导致先前通过的测试失效。当前基准测试几乎完全聚焦于解决率,而对回归行为的研究不足。本文提出TDAD(测试驱动的智能体开发),一个开源工具和基准测试方法论,结合基于抽象语法树(AST)的代码-测试图构建与加权影响分析,以揭示最可能受提议变更影响的测试。在SWE-bench Verified上使用两个本地模型(Qwen3-Coder 30B在100个实例上,Qwen3.5-35B-A3B在25个实例上)进行评估,TDAD的GraphRAG工作流将测试级回归减少了70%(从6.08%降至1.82%),并在作为智能体技能部署时将解决率从24%提升至32%。一个令人惊讶的发现是,仅使用TDD提示反而增加了回归(9.94%),这表明较小的模型更受益于上下文信息(要验证哪些测试)而非过程性指令(如何进行TDD)。一个自主的自我改进循环在10个实例子集上将解决率从12%提升至60%,且回归率为0%。这些发现表明,在AI智能体工具设计中,呈现上下文信息优于规定过程性工作流。所有代码、数据和日志均公开于https://github.com/pepealonso95/TDAD。