This paper presents VisLingInstruct, a novel approach to advancing Multi-Modal Language Models (MMLMs) in zero-shot learning. Current MMLMs show impressive zero-shot abilities in multi-modal tasks, but their performance depends heavily on the quality of instructions. VisLingInstruct tackles this by autonomously evaluating and optimizing instructional texts through In-Context Learning, improving the synergy between visual perception and linguistic expression in MMLMs. Alongside this instructional advancement, we have also optimized the visual feature extraction modules in MMLMs, further augmenting their responsiveness to textual content. Our comprehensive experiments on MMLMs, based on FlanT5 and Vicuna, show that VisLingInstruct significantly improves zero-shot performance in visual multi-modal tasks. Notably, it achieves a 13.1% and 9% increase in accuracy over the prior state-of-the-art on the TextVQA and HatefulMemes datasets. Our main code is available at https://github.com/Zhudongsheng75/VisLingInstruct.
翻译:本文提出了VisLingInstruct,一种提升多模态语言模型在零样本学习中性能的新方法。当前的多模态语言模型在多模态任务中展现出令人印象深刻的零样本能力,但其性能严重依赖于指令的质量。VisLingInstruct通过上下文学习自主评估和优化指令文本,改善了多模态语言模型中视觉感知与语言表达之间的协同作用。除了这项指令优化,我们还优化了多模态语言模型中的视觉特征提取模块,进一步增强了其对文本内容的响应能力。我们在基于FlanT5和Vicuna的多模态语言模型上进行的全面实验表明,VisLingInstruct显著提升了视觉多模态任务中的零样本性能。值得注意的是,在TextVQA和HatefulMemes数据集上,其准确率较之前的最高水平分别提高了13.1%和9%。我们的主要代码可在 https://github.com/Zhudongsheng75/VisLingInstruct 获取。