Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing

The Audio-Visual Video Parsing task aims to recognize and temporally localize all events occurring in either the audio or visual stream, or both. Capturing accurate event semantics for each audio/visual segment is vital. Prior works directly utilize the extracted holistic audio and visual features for intra- and cross-modal temporal interactions. However, each segment may contain multiple events, resulting in semantically mixed holistic features that can lead to semantic interference during intra- or cross-modal interactions: the event semantics of one segment may incorporate semantics of unrelated events from other segments. To address this issue, our method begins with a Class-Aware Feature Decoupling (CAFD) module, which explicitly decouples the semantically mixed features into distinct class-wise features, including multiple event-specific features and a dedicated background feature. The decoupled class-wise features enable our model to selectively aggregate useful semantics for each segment from clearly matched classes contained in other segments, preventing semantic interference from irrelevant classes. Specifically, we further design a Fine-Grained Semantic Enhancement module for encoding intra- and cross-modal relations. It comprises a Segment-wise Event Co-occurrence Modeling (SECM) block and a Local-Global Semantic Fusion (LGSF) block. The SECM exploits inter-class dependencies of concurrent events within the same timestamp with the aid of a new event co-occurrence loss. The LGSF further enhances the event semantics of each segment by incorporating relevant semantics from more informative global video features. Extensive experiments validate the effectiveness of the proposed modules and loss functions, resulting in a new state-of-the-art parsing performance.

翻译：视听视频解析任务旨在识别并时间定位音频流、视觉流或两者中发生的所有事件。为每个音频/视觉片段捕获准确的事件语义至关重要。先前工作直接利用提取的整体音频和视觉特征进行模态内与跨模态时间交互。然而，每个片段可能包含多个事件，导致语义混杂的整体特征，进而在模态内或跨模态交互中引发语义干扰：一个片段的事件语义可能融入其他片段中无关事件的语义。为解决此问题，本方法首先提出类感知特征解耦模块，将语义混杂的特征显式解耦为不同的类感知特征，包括多个事件特定特征及专用背景特征。解耦后的类感知特征使模型能够从其他片段所含的明确匹配类别中，选择性地聚合对每个片段有用的语义，避免来自无关类别的语义干扰。具体而言，我们进一步设计了细粒度语义增强模块来编码模态内与跨模态关系。该模块包含片段级事件共现建模块与局部-全局语义融合块。片段级事件共现建模块借助新提出的事件共现损失，利用同一时间戳内并发事件的类间依赖关系。局部-全局语义融合块通过融合来自信息更丰富的全局视频特征中的相关语义，进一步增强每个片段的事件语义。大量实验验证了所提模块与损失函数的有效性，最终实现了新的最先进解析性能。