We systematically study the capacity of two large language models for code - CodeT5 and Codex - to generalize to out-of-domain data. In this study, we consider two fundamental applications - code summarization, and code generation. We split data into domains following its natural boundaries - by an organization, by a project, and by a module within the software project. This makes recognition of in-domain vs out-of-domain data at the time of deployment trivial. We establish that samples from each new domain present both models with a significant challenge of distribution shift. We study how well different established methods can adapt models to better generalize to new domains. Our experiments show that while multitask learning alone is a reasonable baseline, combining it with few-shot finetuning on examples retrieved from training data can achieve very strong performance. In fact, according to our experiments, this solution can outperform direct finetuning for very low-data scenarios. Finally, we consider variations of this approach to create a more broadly applicable method to adapt to multiple domains at once. We find that in the case of code generation, a model adapted to multiple domains simultaneously performs on par with those adapted to each domain individually.
翻译:我们系统研究了两种大型代码语言模型——CodeT5和Codex——在跨领域数据泛化方面的能力。本研究聚焦两个基础应用场景:代码摘要与代码生成。我们按照数据自然边界划分领域:按组织、按项目、以及按软件项目中模块进行划分。这使得在部署阶段可简单识别领域内数据与领域外数据。我们证实,每个新领域中的样本均对两个模型构成了显著的分布偏移挑战。我们探讨了多种现有方法如何有效适配模型以增强其跨领域泛化能力。实验表明,尽管仅使用多任务学习作为基线策略已具合理性,但结合从训练数据中检索样本进行少量样本微调后,可获得极强性能。事实上,根据我们的实验,该方案在极低数据场景下可超越直接微调的效果。最后,我们研究了该方法的变体,旨在构建一种能同时适配多个领域的更通用方案。实验发现,在代码生成任务中,同时适配多个领域的模型性能与单独适配每个领域的模型表现相当。