This paper presents VisLingInstruct, a novel approach to advancing Multi-Modal Language Models (MMLMs) in zero-shot learning. Current MMLMs show impressive zero-shot abilities in multi-modal tasks, but their performance depends heavily on the quality of instructions. VisLingInstruct tackles this by autonomously evaluating and optimizing instructional texts through In-Context Learning, improving the synergy between visual perception and linguistic expression in MMLMs. Alongside this instructional advancement, we have also optimized the visual feature extraction modules in MMLMs, further augmenting their responsiveness to textual cues. Our comprehensive experiments on MMLMs, based on FlanT5 and Vicuna, show that VisLingInstruct significantly improves zero-shot performance in visual multi-modal tasks. Notably, it achieves a 13.1% and 9% increase in accuracy over the prior state-of-the-art on the TextVQA and HatefulMemes datasets.
翻译:本文提出VisLingInstruct,一种旨在推进多模态语言模型(MMLMs)零样本学习能力的新方法。现有MMLMs在多模态任务中展现出令人瞩目的零样本能力,但其性能高度依赖于指令质量。VisLingInstruct通过上下文学习自主评估并优化指令文本,从而改善MMLMs中视觉感知与语言表达之间的协同性。除指令层面的改进外,我们还优化了MMLMs的视觉特征提取模块,进一步增强其对文本线索的响应能力。基于FlanT5和Vicuna的MMLMs综合实验表明,VisLingInstruct显著提升了视觉多模态任务中的零样本性能。值得注意的是,在TextVQA和HatefulMemes数据集上,该方法相较此前最优技术分别实现了13.1%和9%的准确率提升。