Audio-visual video parsing is the task of categorizing a video at the segment level with weak labels, and predicting them as audible or visible events. Recent methods for this task leverage the attention mechanism to capture the semantic correlations among the whole video across the audio-visual modalities. However, these approaches have overlooked the importance of individual segments within a video and the relationship among them, and tend to rely on a single modality when learning features. In this paper, we propose a novel interactive-enhanced cross-modal perception method~(CM-PIE), which can learn fine-grained features by applying a segment-based attention module. Furthermore, a cross-modal aggregation block is introduced to jointly optimize the semantic representation of audio and visual signals by enhancing inter-modal interactions. The experimental results show that our model offers improved parsing performance on the Look, Listen, and Parse dataset compared to other methods.
翻译:音视频解析任务旨在通过弱标签对视频片段进行级别分类,并预测其对应可听或可见事件。现有方法通常利用注意力机制捕捉整个视频在音视频模态间的语义关联,但忽视了视频中独立片段及其相互关系的重要性,且在学习特征时容易依赖单一模态。为此,本文提出一种新颖的交互增强跨模态感知方法(CM-PIE),通过引入基于片段的注意力模块学习细粒度特征。进一步,设计跨模态聚合模块,通过增强模态间交互以联合优化音频与视觉信号的语义表征。实验结果表明,在Look, Listen, and Parse数据集上,该模型相较其他方法获得了更优的解析性能。