We systematically study how three large language models with code capabilities - CodeT5, Codex, and ChatGPT - generalize to out-of-domain data. We consider two fundamental applications - code summarization, and code generation. We split data into domains following its natural boundaries - by an organization, by a project, and by a module within the software project. We establish that samples from each new domain present all the models with a significant challenge of distribution shift. We study how established methods adapt models to better generalize to new domains. Our experiments show that while multitask learning alone is a reasonable baseline, combining it with few-shot finetuning on examples retrieved from training data can achieve very strong performance. Moreover, this solution can outperform direct finetuning for very low-data scenarios. Finally, we consider variations of this approach to create a more broadly applicable method to adapt to multiple domains at once. We find that for code generation, a model adapted to multiple domains simultaneously performs on par with those adapted to a single domain
翻译:我们系统研究了三种具备代码能力的大型语言模型——CodeT5、Codex和ChatGPT——在域外数据上的泛化能力。我们重点关注两个基础应用:代码摘要与代码生成。根据数据的自然边界(按组织机构、按项目、按软件项目内的模块)划分数据域。研究发现,每个新域中的样本都会给所有模型带来显著的分布偏移挑战。我们探究了现有方法如何使模型更好地适应新域并提升泛化性能。实验表明,虽然单独使用多任务学习已能作为合理基线,但将多任务学习与基于训练数据检索样本的少样本微调相结合,可取得非常强的性能。此外,在数据极少的场景下,该方案表现优于直接微调。最后,我们对该方法进行变体研究,以构建能同时适应多域的更通用方案。实验发现,在代码生成任务中,同时适应多域的模型性能与仅适应单域的模型相当。