We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these representations with textual hypotheses. We evaluate two inference techniques: cosine similarity and visual question answering. Our method achieves high accuracy without task-specific fine-tuning, demonstrating robustness against textual biases and surface heuristics. Additionally, we design a controlled adversarial dataset to validate the robustness of our approach. Our findings suggest that leveraging visual modality as a meaning representation provides a promising direction for robust natural language understanding.
翻译:我们提出了一种零样本自然语言推理方法,该方法通过将语言接地于视觉上下文中来利用多模态表示。我们的方法使用文本到图像模型生成前提的视觉表示,并通过将这些表示与文本假设进行比较来执行推理。我们评估了两种推理技术:余弦相似度和视觉问答。我们的方法无需任务特定的微调即可实现高精度,证明了其对文本偏见和表面启发式的鲁棒性。此外,我们设计了一个受控对抗数据集来验证我们方法的鲁棒性。我们的研究结果表明,利用视觉模态作为意义表示,为鲁棒的自然语言理解提供了一个有前景的方向。