Despite the recent advances showing that a model pre-trained on large-scale source code data is able to gain appreciable generalization capability, it still requires a sizeable amount of data on the target task for fine-tuning. And the effectiveness of the model generalization is largely affected by the size and quality of the fine-tuning data, which is detrimental for target tasks with limited or unavailable resources. Therefore, cross-task generalization, with the goal of improving the generalization of the model to unseen tasks that have not been seen before, is of strong research and application value. In this paper, we propose a large-scale benchmark that includes 216 existing code-related tasks. Then, we annotate each task with the corresponding meta information such as task description and instruction, which contains detailed information about the task and a solution guide. This also helps us to easily create a wide variety of ``training/evaluation'' task splits to evaluate the various cross-task generalization capabilities of the model. Then we perform some preliminary experiments to demonstrate that the cross-task generalization of models can be largely improved by in-context learning methods such as few-shot learning and learning from task instructions, which shows the promising prospects of conducting cross-task learning research on our benchmark. We hope that the collection of the datasets and our benchmark will facilitate future work that is not limited to cross-task generalization.
翻译:尽管近期研究表明,在大规模源代码数据上预训练的模型具备显著的泛化能力,但其仍需大量目标任务数据进行微调。且模型泛化的有效性很大程度上受微调数据规模与质量的影响,这对资源有限或缺乏标注的目标任务构成障碍。因此,旨在提升模型对未知任务泛化能力的跨任务泛化研究具有重要的研究与应用价值。本文提出一个包含216个现有代码相关任务的大规模基准测试集。我们为每个任务标注了任务描述与指令等元信息,其中包含任务详细信息及解决方案指南。这有助于我们灵活构建多样化的"训练/评估"任务划分,以评估模型的多维度跨任务泛化能力。通过初步实验,我们验证了小样本学习、任务指令学习等情境学习方法能显著提升模型的跨任务泛化能力,表明在此基准上进行跨任务学习研究具有广阔前景。我们期待本数据集及基准测试能为包括但不限于跨任务泛化在内的未来研究提供支持。