Recently, multi-task instruction tuning has been applied into sentence representation learning, which endows the capability of generating specific representations with the guidance of task instruction, exhibiting strong generalization ability on new tasks. However, these methods mostly neglect the potential interference problems across different tasks and instances, which may affect the training and convergence of the model. To address it, we propose a data curriculum method, namely Data-CUBE, that arranges the orders of all the multi-task data for training, to minimize the interference risks from the two views. In the task level, we aim to find the optimal task order to minimize the total cross-task interference risk, which is exactly the traveling salesman problem, hence we utilize a simulated annealing algorithm to find its solution. In the instance level, we measure the difficulty of all instances per task, then divide them into the easy-to-difficult mini-batches for training. Experiments on MTEB sentence representation evaluation tasks show that our approach can boost the performance of state-of-the-art methods. Our code and data are publicly available at the link: \url{https://github.com/RUCAIBox/Data-CUBE}.
翻译:摘要:近年来,多任务指令微调被应用于句子表示学习,该方法能够在任务指令的引导下生成特定表示,并在新任务上展现出强大的泛化能力。然而,这些方法大多忽略了不同任务和实例之间潜在的干扰问题,这可能影响模型的训练和收敛。为解决该问题,我们提出了一种数据课程方法——Data-CUBE,该方法通过安排多任务数据的训练顺序,从两个视角最小化干扰风险。在任务层面,我们旨在寻找最优任务顺序以最小化总跨任务干扰风险,这本质上是一个旅行商问题,因此我们采用模拟退火算法求解。在实例层面,我们度量每个任务下所有实例的难度,并按由易到难的顺序将其划分为小批量数据进行训练。在MTEB句子表示评估任务上的实验表明,我们的方法能够提升现有最先进方法的性能。我们的代码和数据已公开在以下链接:\url{https://github.com/RUCAIBox/Data-CUBE}。