Multimodal language generation, which leverages the synergy of language and vision, is a rapidly expanding field. However, existing vision-language models face challenges in tasks that require complex linguistic understanding. To address this issue, we introduce Visual-Language models as Importance Sampling weights (VLIS), a novel framework that combines the visual conditioning capability of vision-language models with the language understanding of unimodal text-only language models without further training. It extracts pointwise mutual information of each image and text from a visual-language model and uses the value as an importance sampling weight to adjust the token likelihood from a text-only model. VLIS improves vision-language models on diverse tasks, including commonsense understanding (WHOOPS, OK-VQA, and ScienceQA) and complex text generation (Concadia, Image Paragraph Captioning, and ROCStories). Our results suggest that VLIS represents a promising new direction for multimodal language generation.
翻译:多模态语言生成利用语言与视觉的协同作用,是一个快速发展的领域。然而,现有的视觉-语言模型在需要复杂语言理解的任务中面临挑战。为解决这一问题,我们提出视觉-语言模型作为重要性采样权重(VLIS),这是一种新颖框架,无需进一步训练即可结合视觉-语言模型的视觉条件能力与单模态纯文本语言模型的语言理解能力。该框架从视觉-语言模型中提取每张图像与文本的点互信息,并将其作为重要性采样权重,以调整纯文本模型生成的词元似然。VLIS在常识理解任务(WHOOPS、OK-VQA和ScienceQA)及复杂文本生成任务(Concadia、图像段落描述和ROCStories)中提升了视觉-语言模型的性能。我们的结果表明,VLIS为多模态语言生成开辟了有前景的新方向。