CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs.

翻译：摘要：基于海量源代码预训练的大语言模型在代码智能领域取得了显著进展。然而，现有代码大语言模型在架构和预训练任务方面存在两大局限。其一，它们常采用特定架构（仅编码器或仅解码器），或依赖统一的编码器-解码器网络处理不同下游任务。前者因缺乏灵活性而受限于应用场景，后者则将模型视为适配所有任务的单一系统，导致在部分任务上表现欠佳。其二，它们通常使用有限的预训练目标，这些目标可能与某些下游任务无关，进而造成性能显著下降。为解决这些局限，我们提出“CodeT5+”——一系列针对代码的编码器-解码器大语言模型，其组件模块可灵活组合以适配广泛的代码下游任务。这种灵活性源于我们提出的混合预训练目标策略，旨在缓解预训练与微调之间的差异。这些目标涵盖跨度去噪、对比学习、文本-代码匹配及因果语言模型预训练任务，并在单模态与双模态多语言代码语料库上执行。此外，我们提出利用冻结的现成大语言模型初始化CodeT5+，避免从头训练以实现高效模型扩展，并探索指令微调以对齐自然语言指令。我们在超过20个代码相关基准上全面评估CodeT5+，涵盖零样本、微调和指令微调等多种设定。实验表明，该模型在代码生成与补全、数学编程及文本-代码检索等各类代码任务上均达到最优性能。特别地，我们经过指令微调的CodeT5+ 16B模型在HumanEval代码生成任务上，相较其他开放代码大语言模型取得了新的最优结果。