Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model's reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for vision-language tasks have been well-studied. However, solving these tasks in a zero-shot setting is less explored. Since Contrastive Language-Image Pre-training (CLIP) has shown remarkable zero-shot performance on image-text matching, previous works utilized its strong zero-shot ability by converting vision-language tasks into an image-text matching problem, and they mainly consider global-level matching (e.g., the whole image or sentence). However, we find visual and textual fine-grained information, e.g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding. Inspired by this, we propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning, covering multiple tasks such as VQA, SNLI-VE, and VCR. Our experiments show that our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR. Furthermore, our ablation studies confirm the effectiveness and generalizability of our proposed method. Code will be available at https://github.com/ThreeSR/UniFine
翻译:视觉语言任务(如VQA、SNLI-VE和VCR)具有挑战性,因为它们要求模型具备理解视觉世界和自然语言语义的推理能力。针对视觉语言任务的监督方法已得到充分研究,但在零样本设置下解决这些任务的研究相对较少。由于对比语言-图像预训练(CLIP)在图像-文本匹配任务上展现出卓越的零样本性能,先前的工作通过将视觉语言任务转化为图像-文本匹配问题来利用其强大的零样本能力,且主要关注全局层面(如整张图像或完整句子)的匹配。然而,我们发现视觉和文本中的细粒度信息(例如句子中的关键词和图像中的物体)对语义理解具有重要价值。受此启发,我们提出一个统一框架,利用细粒度信息进行零样本视觉语言学习,涵盖VQA、SNLI-VE和VCR等多个任务。实验表明,我们的框架在VQA任务上优于先前的零样本方法,并在SNLI-VE和VCR任务上取得显著改进。此外,消融研究证实了我们方法的有效性和泛化能力。代码将发布于https://github.com/ThreeSR/UniFine。