Humans possess a remarkable ability to integrate auditory and visual information, enabling a deeper understanding of the surrounding environment. This early fusion of audio and visual cues, demonstrated through cognitive psychology and neuroscience research, offers promising potential for developing multimodal perception models. However, training early fusion architectures poses significant challenges, as the increased model expressivity requires robust learning frameworks to harness their enhanced capabilities. In this paper, we address this challenge by leveraging the masked reconstruction framework, previously successful in unimodal settings, to train audio-visual encoders with early fusion. Additionally, we propose an attention-based fusion module that captures interactions between local audio and visual representations, enhancing the model's ability to capture fine-grained interactions. While effective, this procedure can become computationally intractable, as the number of local representations increases. Thus, to address the computational complexity, we propose an alternative procedure that factorizes the local representations before representing audio-visual interactions. Extensive evaluations on a variety of datasets demonstrate the superiority of our approach in audio-event classification, visual sound localization, sound separation, and audio-visual segmentation. These contributions enable the efficient training of deeply integrated audio-visual models and significantly advance the usefulness of early fusion architectures.
翻译:人类拥有将听觉与视觉信息融合的卓越能力,从而能够更深入地理解周围环境。认知心理学与神经科学研究表明,这种音频与视觉线索的早期融合,为开发多模态感知模型提供了广阔前景。然而,训练早期融合架构面临重大挑战:模型表达能力的提升需要稳健的学习框架才能充分发挥其增强性能。本文通过利用先前在单模态场景中表现优异的掩码重建框架,训练具有早期融合机制的音频-视觉编码器,从而解决这一难题。此外,我们提出了一种基于注意力的融合模块,用于捕捉局部音频表示与视觉表示之间的交互,增强模型对细粒度交互的捕获能力。尽管该方法有效,但随着局部表示数量的增加,其计算开销可能变得难以承受。为应对计算复杂度问题,我们提出另一种方案:在表征音频-视觉交互之前,先对局部表示进行因子化处理。在多个数据集上的广泛评估表明,本方法在音频事件分类、视觉声源定位、声音分离及音频-视觉分割任务中均具有优越性。这些贡献使得深度集成的音频-视觉模型能够高效训练,并显著推进了早期融合架构的实用价值。