Video multimodal fusion aims to integrate multimodal signals in videos, such as visual, audio and text, to make a complementary prediction with multiple modalities contents. However, unlike other image-text multimodal tasks, video has longer multimodal sequences with more redundancy and noise in both visual and audio modalities. Prior denoising methods like forget gate are coarse in the granularity of noise filtering. They often suppress the redundant and noisy information at the risk of losing critical information. Therefore, we propose a denoising bottleneck fusion (DBF) model for fine-grained video multimodal fusion. On the one hand, we employ a bottleneck mechanism to filter out noise and redundancy with a restrained receptive field. On the other hand, we use a mutual information maximization module to regulate the filter-out module to preserve key information within different modalities. Our DBF model achieves significant improvement over current state-of-the-art baselines on multiple benchmarks covering multimodal sentiment analysis and multimodal summarization tasks. It proves that our model can effectively capture salient features from noisy and redundant video, audio, and text inputs. The code for this paper is publicly available at https://github.com/WSXRHFG/DBF.
翻译:视频多模态融合旨在整合视频中的多模态信号(如视觉、音频和文本),通过多种模态内容实现互补预测。然而,与其它图像-文本多模态任务不同,视频包含更长的多模态序列,其视觉和音频模态中存在更多冗余和噪声。现有去噪方法(如遗忘门)在噪声过滤的粒度上较为粗糙,常以丢失关键信息的代价来抑制冗余和噪声信息。因此,我们提出了一种去噪瓶颈融合(DBF)模型,用于细粒度视频多模态融合。一方面,我们采用瓶颈机制,通过受限的感受野过滤噪声和冗余;另一方面,我们使用互信息最大化模块来调节过滤模块,以保留不同模态中的关键信息。我们的DBF模型在涵盖多模态情感分析和多模态摘要任务的多个基准测试上,相较于当前最先进的基线方法取得了显著提升。这证明我们的模型能够有效从嘈杂且冗余的视频、音频和文本输入中捕获显著特征。本文代码公开于https://github.com/WSXRHFG/DBF。