The goal of Multilingual Visual Answer Localization (MVAL) is to locate a video segment that answers a given multilingual question. Existing methods either focus solely on visual modality or integrate visual and subtitle modalities. However, these methods neglect the audio modality in videos, consequently leading to incomplete input information and poor performance in the MVAL task. In this paper, we propose a unified Audio-Visual-Textual Span Localization (AVTSL) method that incorporates audio modality to augment both visual and textual representations for the MVAL task. Specifically, we integrate features from three modalities and develop three predictors, each tailored to the unique contributions of the fused modalities: an audio-visual predictor, a visual predictor, and a textual predictor. Each predictor generates predictions based on its respective modality. To maintain consistency across the predicted results, we introduce an Audio-Visual-Textual Consistency module. This module utilizes a Dynamic Triangular Loss (DTL) function, allowing each modality's predictor to dynamically learn from the others. This collaborative learning ensures that the model generates consistent and comprehensive answers. Extensive experiments show that our proposed method outperforms several state-of-the-art (SOTA) methods, which demonstrates the effectiveness of the audio modality.
翻译:多语言视觉答案定位(MVAL)的目标是定位能够回答给定多语言问题的视频片段。现有方法要么仅关注视觉模态,要么整合视觉与字幕模态。然而,这些方法忽视了视频中的音频模态,从而导致输入信息不完整,在MVAL任务中表现不佳。本文提出了一种统一的音频-视觉-文本跨度定位(AVTSL)方法,该方法引入音频模态以增强MVAL任务中的视觉与文本表示。具体而言,我们整合了三种模态的特征,并开发了三个预测器,每个预测器针对融合模态的独特贡献进行定制:一个音频-视觉预测器、一个视觉预测器以及一个文本预测器。每个预测器基于其各自的模态生成预测。为确保预测结果的一致性,我们引入了音频-视觉-文本一致性模块。该模块采用动态三角损失(DTL)函数,使每个模态的预测器能够动态地从其他模态学习。这种协作学习确保模型生成一致且全面的答案。大量实验表明,我们提出的方法优于多种现有最先进(SOTA)方法,这证明了音频模态的有效性。