Scaling Test-Driven Code Generation from Functions to Classes: An Empirical Study

Test-driven development (TDD) has been adopted to improve Large Language Model (LLM)-based code generation by using tests as executable specifications. However, existing TDD-style code generation studies are largely limited to function-level tasks, leaving class-level synthesis where multiple methods interact through shared state and call dependencies underexplored. In this paper, we scale test-driven code generation from functions to classes via an iterative TDD framework. Our approach first analyzes intra-class method dependencies to derive a feasible generation schedule, and then incrementally implements each method under method-level public tests with reflection-style execution feedback and bounded repair iterations. To support test-driven generation and rigorous class-level evaluation, we construct ClassEval-TDD, a cleaned and standardized variant of ClassEval with consistent specifications, deterministic test environments, and complete method-level public tests. We conduct an empirical study across eight LLMs and compare against the strongest direct-generation baseline (the best of holistic, incremental, and compositional strategies). Our class-level TDD framework consistently improves class-level correctness by 12 to 26 absolute points and achieves up to 71% fully correct classes, while requiring only a small number of repairs on average. These results demonstrate that test-driven generation can effectively scale beyond isolated functions and substantially improve class-level code generation reliability. All code and data are available at https://anonymous.4open.science/r/ClassEval-TDD-C4C9/

翻译：测试驱动开发（TDD）已被用于改进基于大语言模型（LLM）的代码生成，通过将测试用例作为可执行规范。然而，现有的TDD风格代码生成研究主要局限于函数级任务，对于类级合成——其中多个方法通过共享状态和调用依赖进行交互——则探索不足。本文通过一个迭代的TDD框架，将测试驱动代码生成从函数扩展到类。我们的方法首先分析类内方法依赖关系以推导可行的生成顺序，然后在方法级公共测试的约束下，结合反射式执行反馈和有界的修复迭代，逐步实现每个方法。为支持测试驱动生成和严格的类级评估，我们构建了ClassEval-TDD，这是ClassEval的一个经过清理和标准化的变体，具有一致的规范、确定性的测试环境以及完整的方法级公共测试。我们在八个LLM上进行了实证研究，并与最强的直接生成基线（整体、增量、组合策略中的最佳者）进行比较。我们的类级TDD框架将类级正确性持续提高了12至26个绝对百分点，并实现了高达71%的完全正确类，同时平均仅需少量修复。这些结果表明，测试驱动生成能够有效地扩展到孤立函数之外，并显著提高类级代码生成的可靠性。所有代码和数据均可在 https://anonymous.4open.science/r/ClassEval-TDD-C4C9/ 获取。