Test-driven development (TDD) has been adopted to improve Large Language Model (LLM)-based code generation by using tests as executable specifications. However, existing TDD-style code generation studies are largely limited to function-level tasks, leaving class-level synthesis where multiple methods interact through shared state and call dependencies underexplored. In this paper, we scale test-driven code generation from functions to classes via an iterative TDD framework. Our approach first analyzes intra-class method dependencies to derive a feasible generation schedule, and then incrementally implements each method under method-level public tests with reflection-style execution feedback and bounded repair iterations. To support test-driven generation and rigorous class-level evaluation, we construct ClassEval-TDD, a cleaned and standardized variant of ClassEval with consistent specifications, deterministic test environments, and complete method-level public tests. We conduct an empirical study across eight LLMs and compare against the strongest direct-generation baseline (the best of holistic, incremental, and compositional strategies). Our class-level TDD framework consistently improves class-level correctness by 12 to 26 absolute points and achieves up to 71% fully correct classes, while requiring only a small number of repairs on average. These results demonstrate that test-driven generation can effectively scale beyond isolated functions and substantially improve class-level code generation reliability. All code and data are available at https://anonymous.4open.science/r/ClassEval-TDD-C4C9/
翻译:测试驱动开发(TDD)已被用于改进基于大语言模型(LLM)的代码生成,通过将测试用例作为可执行规范。然而,现有的TDD风格代码生成研究主要局限于函数级任务,对于类级合成——其中多个方法通过共享状态和调用依赖进行交互——则探索不足。本文通过一个迭代的TDD框架,将测试驱动代码生成从函数扩展到类。我们的方法首先分析类内方法依赖关系以推导可行的生成顺序,然后在方法级公共测试的约束下,结合反射式执行反馈和有界的修复迭代,逐步实现每个方法。为支持测试驱动生成和严格的类级评估,我们构建了ClassEval-TDD,这是ClassEval的一个经过清理和标准化的变体,具有一致的规范、确定性的测试环境以及完整的方法级公共测试。我们在八个LLM上进行了实证研究,并与最强的直接生成基线(整体、增量、组合策略中的最佳者)进行比较。我们的类级TDD框架将类级正确性持续提高了12至26个绝对百分点,并实现了高达71%的完全正确类,同时平均仅需少量修复。这些结果表明,测试驱动生成能够有效地扩展到孤立函数之外,并显著提高类级代码生成的可靠性。所有代码和数据均可在 https://anonymous.4open.science/r/ClassEval-TDD-C4C9/ 获取。