Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored. We investigate this problem from the perspective of competition-level programming (CP) programs and propose TCGBench, a Benchmark for (LLM generation of) Test Case Generators. This benchmark comprises two tasks, aimed at studying the capabilities of LLMs in (1) generating valid test case generators for a given CP problem, and further (2) generating targeted test case generators that expose bugs in human-written code. Experimental results indicate that while state-of-the-art LLMs can generate valid test case generators in most cases, most LLMs struggle to generate targeted test cases that reveal flaws in human code effectively. Especially, even advanced reasoning models (e.g., o3-mini) fall significantly short of human performance in the task of generating targeted generators. Furthermore, we construct a high-quality, manually curated dataset of instructions for generating targeted generators. Analysis demonstrates that the performance of LLMs can be enhanced with the aid of this dataset, by both prompting and fine-tuning.
翻译:大型语言模型(LLMs)在代码生成方面展现出卓越的能力,能够在推理过程中处理复杂任务。然而,LLMs在多大程度上可通过测试用例生成用于代码检查或调试,目前仍很大程度上未被探索。本研究从竞赛级编程(CP)程序的角度出发,提出TCGBench——一个用于评估(LLM生成)测试用例生成器的基准。该基准包含两项任务,旨在研究LLMs在以下两方面的能力:(1)为给定CP问题生成有效的测试用例生成器,以及(2)进一步生成能够暴露人工编写代码中缺陷的针对性测试用例生成器。实验结果表明,尽管最先进的LLMs在多数情况下能够生成有效的测试用例生成器,但大多数LLMs难以高效生成能够有效揭示人工代码缺陷的针对性测试用例。特别是,即使是先进的推理模型(如o3-mini),在生成针对性生成器的任务中也显著落后于人类表现。此外,我们构建了一个高质量、经人工标注的针对性生成器生成指令数据集。分析表明,通过提示微调和模型精调,借助该数据集能够有效提升LLMs在此任务上的性能。