LLM-based Mockless Unit Test Generation for Java

Large language models (LLMs) have shown strong potential for automated test generation, yet most approaches to generating Java unit tests still rely on mocking frameworks to handle dependencies. Mockless test generation could exercise more real low-level code, but it faces challenges such as invalid test code generation due to hallucination, strict language constraints, and inadequate dependency awareness. We identify two causes behind these hallucinations: not knowing, where the LLM lacks sufficient context, and not following, where the LLM fails to comply with constraints even when they are provided. We present MocklessTester, a mockless unit test generation approach built around two strategies: context-enriched generation and constraint-enforced fixing. To mitigate not knowing, context-enriched generation mines real usage patterns from existing code to generate tests. To mitigate not following, constraint-enforced fixing performs two-stage repair under symbol-, protocol-, and iteration-level constraints, using a ClassIndex, a Markov typestate model, and experience memory. We evaluate MocklessTester against the state-of-the-art baseline on Defects4J and Deps4J. Results show that MocklessTester improves line coverage by 19.99% and 22.69% and branch coverage by 24.90% and 15.78% on the two benchmarks, respectively, and improves mutation score by 13.67% and 0.17%. Beyond the class under test, MocklessTester also exercises more real dependency code, covering 378 and 55 additional lines in dependency classes, respectively. The improvement in test quality comes with higher total token and time costs than the baseline. Nevertheless, the cost per method remains practical, averaging 108.97 seconds and 26.59k tokens on Defects4J, and 69.85 seconds and 25.46k tokens on Deps4J. Ablation results confirm that all major components contribute positively to the final performance.

翻译：大语言模型（LLM）在自动化测试生成方面展现出强大潜力，然而当前生成Java单元测试的主流方法仍依赖模拟框架处理依赖关系。无Mock测试生成能更真实地执行底层代码，但面临幻觉导致的无效测试代码生成、严格的语言约束约束以及不充分的依赖感知等挑战。我们识别出导致幻觉的两个根源：其一是“不知晓”（not knowing），即LLM缺乏足够的上下文信息；其二是“不遵循”（not following），即即便提供了约束条件，LLM仍未能遵守。本文提出MocklessTester，一种围绕两大策略构建的无Mock单元测试生成方法：上下文增强生成与约束强化修复。为缓解“不知晓”问题，上下文增强生成通过挖掘既有代码中的真实使用模式来生成测试；为缓解“不遵循”问题，约束强化修复在符号级、协议级和迭代级约束下执行两阶段修复，并引入类索引（ClassIndex）、马尔可夫类型状态模型和经验记忆机制。我们在Defects4J和Deps4J两个基准数据集上，将MocklessTester与当前最先进的基线方法进行对比评估。结果表明，在两个基准上，MocklessTester分别将行覆盖率提升19.99%和22.69%，分支覆盖率提升24.90%和15.78%，变异评分提升13.67%和0.17%。除被测类外，MocklessTester还额外执行了更多真实的依赖代码，分别覆盖了依赖类中378行和55行代码。测试质量的提升伴随更高的总token和时间开销，但每个方法的成本仍然实用：在Defects4J上平均耗时108.97秒、消耗26.59k tokens，在Deps4J上平均耗时69.85秒、消耗25.46k tokens。消融实验结果表明，所有主要组件对最终性能均有正向贡献。