Current multimodal language models (MLMs) evaluation and training approaches overlook the influence of instruction format, presenting an elephant-in-the-room problem. Previous research deals with this problem by manually crafting instructions, failing to yield significant insights due to limitations in diversity and scalability. In this work, we propose a programmatic instruction template generator capable of producing over 39B unique template combinations by filling randomly sampled positional synonyms into weighted sampled meta templates, enabling us to comprehensively examine the MLM's performance across diverse instruction templates. Our experiments across eight common MLMs on five benchmark datasets reveal that MLMs have high template sensitivities with at most 29% performance gaps between different templates. We further augment the instruction tuning dataset of LLaVA-1.5 with our template generator and perform instruction tuning on LLaVA-1.5-7B and LLaVA-1.5-13B. Models tuned on our augmented dataset achieve the best overall performance when compared with the same scale MLMs tuned on at most 75 times the scale of our augmented dataset, highlighting the importance of instruction templates in MLM training. The code is available at https://github.com/shijian2001/TemplateMatters .
翻译:当前多模态语言模型(MLMs)的评估与训练方法忽视了指令格式的影响,构成了一个"房间里的大象"问题。先前研究通过人工设计指令来处理此问题,但由于多样性和可扩展性的限制,未能产生显著洞见。在本工作中,我们提出了一种程序化指令模板生成器,能够通过将随机采样的位置同义词填入加权采样的元模板,生成超过390亿种独特的模板组合,从而使我们能够全面考察MLM在不同指令模板下的性能。我们在五个基准数据集上对八种常见MLM进行的实验表明,MLM具有较高的模板敏感性,不同模板间的性能差距最高可达29%。我们进一步使用模板生成器对LLaVA-1.5的指令微调数据集进行增强,并对LLaVA-1.5-7B和LLaVA-1.5-13B进行指令微调。与使用规模最高达我们增强数据集75倍的同规模MLM进行微调的模型相比,在我们增强数据集上微调的模型实现了最佳整体性能,凸显了指令模板在MLM训练中的重要性。代码发布于 https://github.com/shijian2001/TemplateMatters。