Natural Language Inference (NLI) is a crucial task in natural language processing that involves determining the relationship between two sentences, typically referred to as the premise and the hypothesis. However, traditional NLI models solely rely on the semantic information inherent in independent sentences and lack relevant situational visual information, which can hinder a complete understanding of the intended meaning of the sentences due to the ambiguity and vagueness of language. To address this challenge, we propose an innovative ScenaFuse adapter that simultaneously integrates large-scale pre-trained linguistic knowledge and relevant visual information for NLI tasks. Specifically, we first design an image-sentence interaction module to incorporate visuals into the attention mechanism of the pre-trained model, allowing the two modalities to interact comprehensively. Furthermore, we introduce an image-sentence fusion module that can adaptively integrate visual information from images and semantic information from sentences. By incorporating relevant visual information and leveraging linguistic knowledge, our approach bridges the gap between language and vision, leading to improved understanding and inference capabilities in NLI tasks. Extensive benchmark experiments demonstrate that our proposed ScenaFuse, a scenario-guided approach, consistently boosts NLI performance.
翻译:自然语言推理是自然语言处理中的关键任务,旨在判断两个句子(通常称为前提和假设)之间的关系。然而,传统NLI模型仅依赖独立句子固有的语义信息,缺乏相关情境视觉信息,这可能导致对句子意图理解不完整,因为语言存在歧义性和模糊性。针对这一挑战,我们提出创新性的ScenaFuse适配器,该适配器同时整合大规模预训练语言知识与相关视觉信息以完成NLI任务。具体而言,我们首先设计图像-句子交互模块,将视觉信息融入预训练模型的注意力机制中,使两种模态实现全面交互。此外,我们引入图像-句子融合模块,该模块能自适应整合图像中的视觉信息与句子中的语义信息。通过融合相关视觉信息并借助语言知识,我们的方法弥合了语言与视觉之间的鸿沟,从而提升NLI任务的理解与推理能力。大量基准实验表明,所提出的场景引导方法ScenaFuse能持续提升NLI性能。