Despite the recent advances showing that a model pre-trained on large-scale source code data is able to gain appreciable generalization capability, it still requires a sizeable amount of data on the target task for fine-tuning. And the effectiveness of the model generalization is largely affected by the size and quality of the fine-tuning data, which is detrimental for target tasks with limited or unavailable resources. Therefore, cross-task generalization, with the goal of improving the generalization of the model to unseen tasks that have not been seen before, is of strong research and application value. In this paper, we propose a large-scale benchmark that includes 216 existing code-related tasks. Then, we annotate each task with the corresponding meta information such as task description and instruction, which contains detailed information about the task and a solution guide. This also helps us to easily create a wide variety of ``training/evaluation'' task splits to evaluate the various cross-task generalization capabilities of the model. Then we perform some preliminary experiments to demonstrate that the cross-task generalization of models can be largely improved by in-context learning methods such as few-shot learning and learning from task instructions, which shows the promising prospects of conducting cross-task learning research on our benchmark. We hope that the collection of the datasets and our benchmark will facilitate future work that is not limited to cross-task generalization.
翻译:尽管近期研究表明,在大规模源代码数据上预训练的模型能够获得可观的泛化能力,但其对目标任务仍需要相当规模的数据进行微调。模型泛化的有效性很大程度上受微调数据规模与质量的影响,这对资源有限或不可用的目标任务构成阻碍。因此,旨在提升模型对未见任务(即从未见过的新任务)泛化能力的跨任务泛化研究具有重要的学术价值与应用前景。本文提出一个包含216项现有代码相关任务的大规模基准数据集,并为每项任务标注对应的元信息(如任务描述和指令),这些信息包含任务详情及解决方案指南。这使我们能够便捷地构建多样化的"训练/评估"任务划分方案,以评估模型的各种跨任务泛化能力。通过初步实验,我们验证了通过上下文学习方法(如少样本学习和任务指令学习)可显著提升模型的跨任务泛化能力,这展示了在该基准上进行跨任务学习研究的广阔前景。我们期望该数据集集合与基准能够推动包含但不限于跨任务泛化领域的未来研究工作。