Instruction Tuning involves finetuning a language model on a collection of instruction-formatted datasets in order to enhance the generalizability of the model to unseen tasks. Studies have shown the importance of balancing different task proportions during finetuning, but finding the right balance remains challenging. Unfortunately, there's currently no systematic method beyond manual tuning or relying on practitioners' intuition. In this paper, we introduce SMART (Submodular data Mixture strAtegy for instRuction Tuning) - a novel data mixture strategy which makes use of a submodular function to assign importance scores to tasks which are then used to determine the mixture weights. Given a fine-tuning budget, SMART redistributes the budget among tasks and selects non-redundant samples from each task. Experimental results demonstrate that SMART significantly outperforms traditional methods such as examples proportional mixing and equal mixing. Furthermore, SMART facilitates the creation of data mixtures based on a few representative subsets of tasks alone and through task pruning analysis, we reveal that in a limited budget setting, allocating budget among a subset of representative tasks yields superior performance compared to distributing the budget among all tasks. The code for reproducing our results is open-sourced at https://github.com/kowndinya-renduchintala/SMART.
翻译:指令微调涉及在指令格式数据集的集合上对语言模型进行微调,以增强模型对未见任务的泛化能力。研究表明,在微调过程中平衡不同任务的比例至关重要,但找到合适的平衡仍具挑战性。目前除了手动调整或依赖实践者的直觉外,尚无系统的方法。本文提出了SMART(面向指令微调的次模数据混合策略)——一种新颖的数据混合策略,利用次模函数为任务分配重要性分数,进而确定混合权重。给定微调预算,SMART在任务间重新分配预算,并从每个任务中选择非冗余样本。实验结果表明,SMART显著优于等比例混合和均匀混合等传统方法。此外,SMART支持仅基于少量代表性任务子集构建数据混合,通过任务剪枝分析揭示:在有限预算场景下,将预算分配给代表性任务子集优于在所有任务间分配预算。重现结果的代码已在https://github.com/kowndinya-renduchintala/SMART开源。