Towards Better Code Understanding in Decoder-Only Models with Contrastive Learning

Recent advances in large-scale code generation models have led to remarkable progress in producing high-quality code. These models are trained in a self-supervised manner on extensive unlabeled code corpora using a decoder-only architecture. However, despite their generative strength, decoder-only models often exhibit limited performance on code understanding tasks such as code search and clone detection, primarily due to their generation-oriented training objectives. While training large encoder-only models from scratch on massive code datasets can improve understanding ability but remains computationally expensive and time-consuming. In this paper, we explore a more efficient alternative by transferring knowledge from pre-trained decoder-only code generation models to code understanding tasks. We investigate how decoder-only architectures can be effectively adapted to learn discriminative and semantically meaningful code representations. To this end, we propose CL4D, a contrastive learning framework tailored to strengthen the representation capabilities of decoder-only models. Extensive experiments on multiple benchmark datasets demonstrate that CL4D achieves competitive or superior performance compared to existing methods on representative code understanding tasks, including code search and clone detection. Further analysis reveals that CL4D substantially improves the semantic alignment of code representations by reducing the distance between semantically similar code snippets. These findings highlight the feasibility of leveraging decoder-only models as a unified backbone for both code generation and understanding.

翻译：近年来，大规模代码生成模型的进展在生成高质量代码方面取得了显著成就。这些模型采用仅解码器架构，在大量无标注代码语料上以自监督方式进行训练。然而，尽管其生成能力强大，仅解码器模型在代码搜索和克隆检测等代码理解任务上通常表现有限，这主要归因于其面向生成的训练目标。虽然在大型代码数据集上从头训练仅编码器模型可以提升理解能力，但计算成本高昂且耗时。本文探索了一种更高效的替代方案：将预训练的仅解码器代码生成模型的知识迁移至代码理解任务。我们研究了如何有效调整仅解码器架构以学习具有判别性和语义意义的代码表示。为此，我们提出了CL4D，一个专为增强仅解码器模型表示能力而设计的对比学习框架。在多个基准数据集上的大量实验表明，在包括代码搜索和克隆检测在内的代表性代码理解任务上，CL4D相比现有方法取得了竞争性或更优的性能。进一步分析表明，CL4D通过减小语义相似代码片段之间的距离，显著提升了代码表示的语义对齐。这些发现凸显了利用仅解码器模型作为代码生成与理解统一骨干网络的可行性。