Fine-tuning large language models (LLMs) on multi-task instruction-following data has been proven to be a powerful learning paradigm for improving their zero-shot capabilities on new tasks. Recent works about high-quality instruction-following data generation and selection require amounts of human labor to conceive model-understandable instructions for the given tasks and carefully filter the LLM-generated data. In this work, we introduce an automatic instruction augmentation method named INSTRAUG in multimodal tasks. It starts from a handful of basic and straightforward meta instructions but can expand an instruction-following dataset by 30 times. Results on two popular multimodal instructionfollowing benchmarks MULTIINSTRUCT and InstructBLIP show that INSTRAUG can significantly improve the alignment of multimodal large language models (MLLMs) across 12 multimodal tasks, which is even equivalent to the benefits of scaling up training data multiple times.
翻译:在大规模多任务指令跟随数据上微调大语言模型(LLMs)已被证明是提升模型在新任务上零样本能力的强大学习范式。近期关于高质量指令跟随数据生成与选择的研究,需要投入大量人力来为给定任务构思模型可理解的指令,并精心筛选LLM生成的数据。本文针对多模态任务提出一种名为INSTRAUG的自动指令增强方法。该方法仅需少量基础且直观的元指令作为起点,即可将指令跟随数据集扩展至30倍。在MULTIINSTRUCT和InstructBLIP两个主流多模态指令跟随基准上的实验结果表明,INSTRAUG能显著提升多模态大语言模型(MLLMs)在12项多模态任务上的对齐性能,其效果等效于将训练数据规模扩大数倍所带来收益。