TestMap: Evidence Infrastructure for Foundation-Model-Assisted Test Generation

Foundation models (FMs) can generate plausible unit tests, but determining whether those tests are correct, useful, maintainable, and worth integrating remains difficult. Generated tests must be mapped to the code they target, inserted into real projects, built, executed, measured against the baseline suite, repaired when necessary, and compared across models and generation strategies. This validation process is fragmented across build systems, test runners, coverage tools, mutation tools, static analyzers, and experiment scripts. The problem is especially important because generated tests are both code artifacts and validation artifacts: they must themselves be validated before they can be trusted as evidence about the system under test. This paper presents TestMap, an open-source infrastructure prototype that automates evidence-backed foundation-model-assisted test generation for C#/.NET repositories. TestMap supports repository analysis, source-test mapping, baseline execution, code metric collection, test smell detection, coverage measurement, mutation testing, model-guided test generation, validation, repair, and repository-specific experiment tracking. Rather than reporting only final passing tests, TestMap records the lifecycle of each generated candidate, including failed, repaired, low-impact, and evidence positive outcomes. These intermediate outcomes can reveal model limitations, missing context, repair cost, toolchain inefficiencies, or possible faults in the system under test. Using TestMap as a design case, we describe the architecture and evidence model needed to make generated tests observable, repeatable, and comparable across repositories, models, prompts, and generation strategies. We conclude with lessons learned and open challenges, including oracle and assertion quality, metric attribution, test maintainability, flakiness, execution cost, and developer acceptance.

翻译：[translated abstract in Chinese] 基础模型可生成看似合理的单元测试，但判断这些测试是否正确、有用、可维护并值得集成仍存在困难。生成的测试必须映射到目标代码、插入真实项目、构建执行、对比基准测试套件进行度量、必要时修复、并跨模型及生成策略进行比较。这一验证流程零散分布于构建系统、测试运行器、覆盖率工具、变异测试工具、静态分析器和实验脚本中。该问题尤为关键，因为生成的测试既是代码制品也是验证制品：它们本身必须先被验证，方能作为被测系统的可信证据使用。本文提出开源基础设施原型TestMap，实现面向C#/.NET代码库的自动化可验证基础模型辅助测试生成。TestMap支持仓库分析、源码-测试映射、基准执行、代码度量收集、测试坏味检测、覆盖率测量、变异测试、模型引导的测试生成、验证、修复及仓库专属实验跟踪。不同于仅报告最终通过的测试，TestMap记录每个生成候选测试的完整生命周期，包括失败、修复、低影响及有效证据四种结果。这些中间结果可揭示模型局限、上下文缺失、修复代价、工具链低效或被测系统的潜在缺陷。以TestMap作为设计案例，我们阐述了使生成测试在仓库、模型、提示词和生成策略间具备可观测性、可重复性和可比性所需的架构与证据模型。最后总结经验教训与未解决挑战，包括预言机与断言质量、度量归因、测试可维护性、脆弱性、执行成本和开发者接受度。