Vision-language models have been widely explored across a wide range of tasks and achieve satisfactory performance. However, it's under-explored how to consolidate entity understanding through a varying number of images and to align it with the pre-trained language models for generative tasks. In this paper, we propose MIVC, a general multiple instance visual component to bridge the gap between various image inputs with off-the-shelf vision-language models by aggregating visual representations in a permutation-invariant fashion through a neural network. We show that MIVC could be plugged into the visual-language models to improve the model performance consistently on visual question answering, classification and captioning tasks on a public available e-commerce dataset with multiple images per product. Furthermore, we show that the component provides insight into the contribution of each image to the downstream tasks.
翻译:视觉-语言模型已在广泛任务中得到探索并取得满意性能。然而,如何通过可变数量的图像整合实体理解,并将其与预训练语言模型对齐以支持生成式任务,仍缺乏充分研究。本文提出MIVC——一种通用的多实例视觉组件,通过神经网络以置换不变方式聚合视觉表示,弥合不同图像输入与现有视觉-语言模型之间的鸿沟。我们证明,将MIVC嵌入视觉-语言模型后,在每产品多图像的公开电商数据集上,其视觉问答、分类与图像描述任务的模型性能均得到持续提升。此外,该组件可揭示每张图像对下游任务的贡献程度。