Pre-trained models of source code have gained widespread popularity in many code intelligence tasks. Recently, with the scaling of the model and corpus size, large language models have shown the ability of in-context learning (ICL). ICL employs task instructions and a few examples as demonstrations, and then inputs the demonstrations to the language models for making predictions. This new learning paradigm is training-free and has shown impressive performance in various natural language processing and code intelligence tasks. However, the performance of ICL heavily relies on the quality of demonstrations, e.g., the selected examples. It is important to systematically investigate how to construct a good demonstration for code-related tasks. In this paper, we empirically explore the impact of three key factors on the performance of ICL in code intelligence tasks: the selection, order, and number of demonstration examples. We conduct extensive experiments on three code intelligence tasks including code summarization, bug fixing, and program synthesis. Our experimental results demonstrate that all the above three factors dramatically impact the performance of ICL in code intelligence tasks. Additionally, we summarize our findings and provide takeaway suggestions on how to construct effective demonstrations, taking into account these three perspectives. We also show that a carefully-designed demonstration based on our findings can lead to substantial improvements over widely-used demonstration construction methods, e.g., improving BLEU-4, EM, and EM by at least 9.90%, 175.96%, and 50.81% on code summarization, bug fixing, and program synthesis, respectively
翻译:源代码预训练模型在众多代码智能任务中已获得广泛普及。近期,随着模型与语料库规模的扩展,大语言模型展现了上下文学习能力。该范式通过任务指令与少量示例构建示范,随后将示范输入语言模型进行预测。这种新学习范式无需训练,在自然语言处理与代码智能任务中均展现出卓越性能。然而,上下文学习的性能高度依赖示范质量(如所选示例)。系统探究如何为代码相关任务构建优质示范至关重要。本文通过实验探索影响代码智能任务中上下文学习性能的三个关键因素:示范示例的选择、排序与数量。我们在代码摘要、缺陷修复与程序合成三项代码智能任务上开展广泛实验,结果表明上述三个因素均显著影响上下文学习在代码智能任务中的性能。此外,我们总结研究结论,从这三个维度提出构建有效示范的实践建议。实验证明,基于研究结论精心设计的示范相较主流示范构建方法可实现显著提升——在代码摘要、缺陷修复与程序合成任务中,BLEU-4、EM与EM指标分别至少提升9.90%、175.96%与50.81%。