Audio-Visual Video Parsing (AVVP) task aims to detect and temporally locate events within audio and visual modalities. Multiple events can overlap in the timeline, making identification challenging. While traditional methods usually focus on improving the early audio-visual encoders to embed more effective features, the decoding phase -- crucial for final event classification, often receives less attention. We aim to advance the decoding phase and improve its interpretability. Specifically, we introduce a new decoding paradigm, \underline{l}abel s\underline{e}m\underline{a}ntic-based \underline{p}rojection (LEAP), that employs labels texts of event categories, each bearing distinct and explicit semantics, for parsing potentially overlapping events.LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings. This process, enriched by modeling cross-modal (audio/visual-label) interactions, gradually disentangles event semantics within video segments to refine relevant label embeddings, guaranteeing a more discriminative and interpretable decoding process. To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function. This function leverages the Intersection over Union of audio and visual events (EIoU) as a novel metric to calibrate audio-visual similarities at the feature level, accommodating the varied event densities across modalities. Extensive experiments demonstrate the superiority of our method, achieving new state-of-the-art performance for AVVP and also enhancing the relevant audio-visual event localization task.
翻译:视听视频解析(AVVP)任务旨在检测并定位音频与视觉模态中的事件。时间轴上可能存在多个事件重叠,增加了识别难度。传统方法通常侧重于改进早期视听编码器以嵌入更有效的特征,而对最终事件分类至关重要的解码阶段往往关注不足。本研究致力于推进解码阶段并提升其可解释性。具体而言,我们提出一种新的解码范式——基于标签语义的投影(LEAP),该方法利用具有明确语义区分性的事件类别标签文本,来解析潜在重叠的事件。LEAP通过迭代地将音频/视觉片段的编码隐特征投影到语义独立的标签嵌入上实现解析。该过程通过建模跨模态(音频/视觉-标签)交互得到增强,逐步解耦视频片段内的事件语义以优化相关标签嵌入,从而确保解码过程具有更强的判别性与可解释性。为支撑LEAP范式,我们提出语义感知优化策略,包含新颖的视听语义相似度损失函数。该函数利用音频与视觉事件交并比(EIoU)作为新度量标准,在特征层面校准视听相似度,以适应不同模态间的事件密度差异。大量实验证明本方法的优越性,在AVVP任务中达到新的最优性能,并提升了相关视听事件定位任务的表现。