Recent advances in large language models (LLMs) have improved their performance on coding benchmarks. However, improvement is plateauing due to the exhaustion of readily available high-quality data. Prior work has shown the potential of synthetic self-instruct data, but naively training on a model's own outputs can cause error accumulation, especially in coding tasks, where generalization may collapse due to overly simple or erroneous training data, highlighting the need for rigorous quality checks on synthetic data. In this work, we explore an effective approach whereby the model itself verifies the correctness of its own data. We thus propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity. By iteratively refining code (LLM-as-a-solver) and tests (LLM-as-a-verifier) together, we boost both capabilities without relying on human annotations or larger teacher models. Experiments with the Llama 3.1 8B model demonstrate substantial performance enhancements, achieving average relative improvements of 19.63% in code generation and 17.49% in test generation on MBPP and LiveCodeBench.
翻译:大型语言模型(LLM)在代码基准测试上的性能近期取得了显著进展。然而,由于易于获取的高质量数据趋于枯竭,性能提升正进入平台期。先前的研究已展示了合成自指导数据的潜力,但直接在模型自身输出上进行训练可能导致错误累积,尤其在代码任务中,过于简单或错误的训练数据可能导致泛化能力崩溃,这凸显了对合成数据进行严格质量检查的必要性。在本工作中,我们探索了一种有效方法,即模型自身验证其生成数据的正确性。为此,我们提出了Sol-Ver,一种自博弈的求解器-验证器框架,旨在联合提升单一模型的代码生成与测试生成能力。通过迭代地协同优化代码(LLM作为求解器)与测试(LLM作为验证器),我们在不依赖人工标注或更大教师模型的情况下,同步增强了这两项能力。基于Llama 3.1 8B模型的实验证明了显著的性能提升,在MBPP和LiveCodeBench基准上,代码生成与测试生成分别实现了平均19.63%和17.49%的相对改进。