Improving the code generation capabilities of large language models (LLMs) typically relies on supervised fine-tuning or preference optimization, both of which require costly external resources such as powerful teacher models or reliable test units. However, in real-world scenarios, it is much harder to obtain reference solutions and test oracles than problem descriptions and test inputs. In this paper, we tackle a challenging yet realistic question: Can a code language model improve itself without access to a superior teacher and a test oracle? To answer this, we propose ConSelf, a self-improving approach built upon two key ideas. First, we introduce code semantic entropy, a novel metric that measures problem-level uncertainty by assessing the functional diversity of program behaviors, enabling a curriculum construction with the most learnable problems. Second, we present consensus-driven direct preference optimization (Con-DPO), a preference-based fine-tuning method that weights each preference pair by its behavioral consensus, thereby mitigating the impact of noisy self-generated supervision. Experiments on various benchmarks and backbone LLMs demonstrate that ConSelf significantly outperforms baselines, validating the effectiveness of semantic entropy-based curriculum construction and consensus-driven optimization in improving code generation without external supervision.
翻译:提升大语言模型代码生成能力通常依赖监督微调或偏好优化,两者均需昂贵的外部资源(如强大的教师模型或可靠的测试单元)。然而在真实场景中,获取参考解决方案和测试预言远比获取问题描述和测试输入困难。本文探讨一个具有挑战性的现实问题:在缺乏优秀教师模型与测试预言的情况下,代码语言模型能否实现自我改进?为此,我们提出ConSelf自改进方法,其核心基于两个关键思想:首先,引入代码语义熵这一新颖度量,通过评估程序行为的功能多样性衡量问题级不确定性,从而构建包含最具学习价值问题的课程体系;其次,提出共识驱动的直接偏好优化(Con-DPO)方法,该偏好微调方法根据行为一致性对偏好对加权,有效缓解自生成噪声监督的影响。在多个基准测试和骨干大语言模型上的实验表明,ConSelf显著优于基线方法,验证了基于语义熵的课程构建与共识驱动优化在无外部监督提升代码生成能力中的有效性。