Condor: A Code Discriminator Integrating General Semantics with Code Details

LLMs demonstrate significant potential across various software engineering tasks. However, they still face challenges in generating correct code on the first attempt when addressing complex requirements. Introducing a discriminator to select reliable outputs from multiple generated results is an effective way to enhance their reliability and stability. Currently, these discriminators fall into two categories: execution-based discriminators and non-execution-based discriminators. Execution-based discriminators face flexibility challenges due to difficulties in obtaining test cases and security concerns, while non-execution-based discriminators, although more flexible, struggle to capture subtle differences in code details. To maintain flexibility while improving the model's ability to capture fine-grained code details, this paper proposes Condor. We first design contrastive learning to optimize the code representations of the base model, enabling it to reflect differences in code details. Then, we leverage intermediate data from the code modification process to further enrich the discriminator's training data, enhancing its ability to discern code details. Experimental results indicate that on the subtle code difference dataset (i.e., CodeNanoFix), Condor significantly outperforms other discriminators in discriminative performance: Condor (1.3B) improves the discriminative F1 score of DeepSeek-Coder (1.3B) from 67% to 73%. In discriminating LLM-generated outputs, Condor (1.3B) and Condor (110M) raise the Pass@1 score of Meta-Llama-3.1-Instruct (70B) on the CodeNanoFix dataset from 52.64% to 62.63% and 59.64%, respectively. Moreover, Condor demonstrates strong generalization capabilities on the APPS, MBPP, and LiveCodeBench datasets. For example, Condor (1.3B) improves the Pass@1 of Meta-Llama-3.1-Instruct (70B) on the APPS dataset by 147.05%.

翻译：大型语言模型（LLMs）在各种软件工程任务中展现出巨大潜力。然而，在应对复杂需求时，它们仍面临首次尝试即生成正确代码的挑战。引入判别器从多个生成结果中选择可靠输出，是提升其可靠性与稳定性的有效途径。当前，这些判别器主要分为两类：基于执行的判别器与非基于执行的判别器。基于执行的判别器因难以获取测试用例及存在安全隐患而面临灵活性挑战；而非基于执行的判别器虽更灵活，却难以捕捉代码细节中的细微差异。为在保持灵活性的同时提升模型捕捉细粒度代码细节的能力，本文提出Condor。我们首先设计对比学习来优化基础模型的代码表示，使其能够反映代码细节的差异。随后，我们利用代码修改过程中的中间数据进一步丰富判别器的训练数据，增强其辨别代码细节的能力。实验结果表明，在细微代码差异数据集（即CodeNanoFix）上，Condor在判别性能上显著优于其他判别器：Condor（1.3B）将DeepSeek-Coder（1.3B）的判别F1分数从67%提升至73%。在对LLM生成输出进行判别时，Condor（1.3B）与Condor（110M）分别将Meta-Llama-3.1-Instruct（70B）在CodeNanoFix数据集上的Pass@1分数从52.64%提升至62.63%与59.64%。此外，Condor在APPS、MBPP和LiveCodeBench数据集上展现出强大的泛化能力。例如，Condor（1.3B）将Meta-Llama-3.1-Instruct（70B）在APPS数据集上的Pass@1分数提升了147.05%。