Despite the success of existing instruction-tuned models, we find that they usually struggle to respond to queries with multiple instructions. This impairs their performance in complex problems whose solution consists of multiple intermediate tasks. Thus, we contend that part of the fine-tuning data mixture should be sequential--containing a chain of interrelated tasks. We first approach sequential instruction tuning from a task-driven perspective, manually creating interpretable intermediate tasks for multilingual and visual question answering: namely "translate then predict" and "caption then answer". Next, we automate this process by turning instructions in existing datasets (e.g., Alpaca and FlanCoT) into diverse and complex sequential instructions, making our method general-purpose. Models that underwent our sequential instruction tuning show improved results in coding, maths, and open-ended generation. Moreover, we put forward a new benchmark named SeqEval to evaluate a model's ability to follow all the instructions in a sequence, which further corroborates the benefits of our fine-tuning method. We hope that our endeavours will open new research avenues on instruction tuning for complex tasks.
翻译:尽管现有的指令调优模型已取得一定成功,我们发现它们通常难以响应包含多条指令的查询。这削弱了模型在解决由多个中间任务构成的复杂问题时的性能。因此,我们认为部分微调数据应具有顺序性——即包含一系列相互关联的任务链。我们首先从任务驱动的角度研究顺序指令调优,为多语言问答和视觉问答手动创建可解释的中间任务:即“先翻译后预测”与“先描述后回答”。随后,我们通过将现有数据集(如Alpaca和FlanCoT)中的指令转化为多样化且复杂的顺序指令,使该方法具备通用性。经过顺序指令调优的模型在代码生成、数学推理和开放式生成任务中均表现出改进效果。此外,我们提出了名为SeqEval的新基准测试,用于评估模型遵循指令序列中所有指令的能力,进一步验证了本微调方法的优势。我们希望这项工作能为复杂任务的指令调优研究开辟新途径。