CodeContests-O: Powering LLMs via Feedback-Driven Iterative Test Case Generation

The rise of reasoning models necessitates large-scale verifiable data, for which programming tasks serve as an ideal source. However, while competitive programming platforms provide abundant problems and solutions, high-quality test cases for verification remain scarce. Existing approaches attempt to synthesize test cases using Large Language Models (LLMs), but rely solely on the model's intrinsic generation capabilities without external feedback, frequently resulting in insufficiently diverse cases. To address this limitation, we propose a $\textbf{Feedback-Driven Iterative Framework}$ for comprehensive test case construction. Specifically, our method leverages the LLM to generate initial test cases, executes them against known correct and incorrect solutions, and utilizes the failed results as feedback to guide the LLM in refining the test cases toward high fidelity and discriminability. We then apply this method to the CodeContests dataset to construct an optimized high-quality derivative, $\textbf{CodeContests-O}$. Evaluating against the entire pool of solutions ($1.1 \times 10^7$ in total), our dataset achieves an average True Positive Rate (TPR) of $89.37\%$ and True Negative Rate (TNR) of $90.89\%$, significantly outperforming the CodeContests and CodeContests+ by margins of $4.32\%$ and $9.37\%$, respectively. Furthermore, fine-tuning the Qwen2.5-7B model on CodeContests-O results in a $9.52\%$ improvement on LiveCodeBench (Pass@1). Experiments demonstrate the effectiveness of our framework and the quality of CodeContests-O. To support reproducibility and facilitate future research, we release the $\href{https://github.com/cai-jianfeng/CodeContests-O}{code}$ and $\href{https://huggingface.co/datasets/caijanfeng/CodeContests-O}{dataset}$.

翻译：推理模型的兴起需要大规模可验证数据，编程任务为此提供了理想来源。然而，尽管竞赛编程平台提供了丰富的问题与解决方案，用于验证的高质量测试用例仍然稀缺。现有方法尝试利用大语言模型（LLMs）合成测试用例，但仅依赖模型内在的生成能力而缺乏外部反馈，常导致用例多样性不足。为突破此局限，我们提出一种$\textbf{反馈驱动迭代框架}$以构建全面的测试用例。具体而言，我们的方法利用LLM生成初始测试用例，针对已知正确与错误解决方案执行这些用例，并将失败结果作为反馈引导LLM优化测试用例，从而提升其保真度与判别力。我们将此方法应用于CodeContests数据集，构建出优化的高质量衍生数据集$\textbf{CodeContests-O}$。在全量解决方案池（总计$1.1 \times 10^7$个）上的评估显示，我们的数据集实现了$89.37\%$的平均真阳性率（TPR）与$90.89\%$的真阴性率（TNR），分别以$4.32\%$和$9.37\%$的显著优势超越CodeContests与CodeContests+数据集。此外，基于CodeContests-O微调Qwen2.5-7B模型，在LiveCodeBench（Pass@1）指标上获得了$9.52\%$的性能提升。实验验证了我们框架的有效性及CodeContests-O的数据质量。为支持可复现性并促进未来研究，我们公开了$\href{https://github.com/cai-jianfeng/CodeContests-O}{代码}$与$\href{https://huggingface.co/datasets/caijanfeng/CodeContests-O}{数据集}$。