Existing works on weakly-supervised audio-visual video parsing adopt hybrid attention network (HAN) as the multi-modal embedding to capture the cross-modal context. It embeds the audio and visual modalities with a shared network, where the cross-attention is performed at the input. However, such an early fusion method highly entangles the two non-fully correlated modalities and leads to sub-optimal performance in detecting single-modality events. To deal with this problem, we propose the messenger-guided mid-fusion transformer to reduce the uncorrelated cross-modal context in the fusion. The messengers condense the full cross-modal context into a compact representation to only preserve useful cross-modal information. Furthermore, due to the fact that microphones capture audio events from all directions, while cameras only record visual events within a restricted field of view, there is a more frequent occurrence of unaligned cross-modal context from audio for visual event predictions. We thus propose cross-audio prediction consistency to suppress the impact of irrelevant audio information on visual event prediction. Experiments consistently illustrate the superior performance of our framework compared to existing state-of-the-art methods.
翻译:现有关于弱监督音视频解析的研究采用混合注意力网络(HAN)作为多模态嵌入来捕获跨模态上下文。它通过共享网络嵌入音频和视觉模态,并在输入层执行交叉注意力。然而,这种早期融合方法高度纠缠了两种不完全相关的模态,导致在检测单模态事件时性能次优。为解决这一问题,我们提出了一种引导消息的中期融合Transformer,以减少融合中不相关的跨模态上下文。消息将完整的跨模态上下文压缩为紧凑表示,仅保留有用的跨模态信息。此外,由于麦克风从所有方向捕获音频事件,而摄像头仅在受限视场内记录视觉事件,因此音频中不匹配的跨模态上下文在视觉事件预测中更为频繁。为此,我们提出跨音频预测一致性机制,以抑制无关音频信息对视觉事件预测的影响。实验一致表明,与现有最先进方法相比,我们的框架展现出优越性能。