This paper presents VisLingInstruct, a novel approach to advancing Multi-Modal Language Models (MMLMs) in zero-shot learning. Current MMLMs show impressive zero-shot abilities in multi-modal tasks, but their performance depends heavily on the quality of instructions. VisLingInstruct tackles this by autonomously evaluating and optimizing instructional texts through In-Context Learning, improving the synergy between visual perception and linguistic expression in MMLMs. Alongside this instructional advancement, we have also optimized the visual feature extraction modules in MMLMs, further augmenting their responsiveness to textual cues. Our comprehensive experiments on MMLMs, based on FlanT5 and Vicuna, show that VisLingInstruct significantly improves zero-shot performance in visual multi-modal tasks. Notably, it achieves a 13.1% and 9% increase in accuracy over the prior state-of-the-art on the TextVQA and HatefulMemes datasets.
翻译:本文提出了VisLingInstruct,一种用于推进多模态语言模型(MMLM)零样本学习的新方法。当前MMLM在多模态任务中展现出强大的零样本能力,但其性能高度依赖于指令质量。VisLingInstruct通过上下文学习自主评估并优化指令文本,从而改善MMLM中视觉感知与语言表达之间的协同作用。除指令优化外,我们还改进了MMLM中的视觉特征提取模块,进一步增强了其对文本线索的响应能力。基于FlanT5和Vicuna的MMLM全面实验表明,VisLingInstruct显著提升了视觉多模态任务中的零样本性能。值得注意的是,在TextVQA和HatefulMemes数据集上,相较于此前最优方法,其准确率分别提升了13.1%和9%。