CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs.

翻译：摘要：针对大规模源代码预训练的大语言模型（LLM）已在代码智能领域取得显著进展。然而，现有代码LLM在架构和预训练任务方面存在两大局限：其一，它们通常采用特定架构（仅编码器或仅解码器），或依赖统一的编码器-解码器网络处理不同下游任务。前一种范式因应用灵活性不足而受限，后一种则将模型视为处理所有任务的单一系统，导致在部分任务上性能欠佳。其二，它们常采用有限的预训练目标集合，这些目标可能与某些下游任务无关，从而导致性能显著下降。为解决上述局限，我们提出"CodeT5+"——一个面向代码的编码器-解码器LLM系列，其组件模块可灵活组合以适应广泛的代码下游任务。这种灵活性源于我们提出的混合预训练目标（涵盖跨度去噪、对比学习、文本-代码匹配及因果语言模型预训练任务），旨在缓解预训练与微调之间的差异。这些目标同时作用于单模态与双模态多语言代码语料库。此外，我们提出利用冻结的现成LLM初始化CodeT5+（无需从头训练）以实现模型高效扩展，并探索指令微调以对齐自然语言指令。我们在超过20个代码相关基准测试（涵盖零样本、微调与指令微调场景）上对CodeT5+进行了全面评估。在代码生成与补全、数学编程、文本到代码检索等多项代码任务中，我们的模型达到了最先进（SoTA）性能。特别地，经指令微调的CodeT5+ 16B在HumanEval代码生成任务上，相比其他开源代码LLM取得了新的SoTA结果。