Weakly supervised audio-visual video parsing (AVVP) methods aim to detect audible-only, visible-only, and audible-visible events using only video-level labels. Existing approaches tackle this by leveraging unimodal and cross-modal contexts. However, we argue that while cross-modal learning is beneficial for detecting audible-visible events, in the weakly supervised scenario, it negatively impacts unaligned audible or visible events by introducing irrelevant modality information. In this paper, we propose CoLeaF, a novel learning framework that optimizes the integration of cross-modal context in the embedding space such that the network explicitly learns to combine cross-modal information for audible-visible events while filtering them out for unaligned events. Additionally, as videos often involve complex class relationships, modelling them improves performance. However, this introduces extra computational costs into the network. Our framework is designed to leverage cross-class relationships during training without incurring additional computations at inference. Furthermore, we propose new metrics to better evaluate a method's capabilities in performing AVVP. Our extensive experiments demonstrate that CoLeaF significantly improves the state-of-the-art results by an average of 1.9% and 2.4% F-score on the LLP and UnAV-100 datasets, respectively.
翻译:弱监督视听视频解析(AVVP)方法旨在仅利用视频级标签检测仅可听、仅可见及可听可见事件。现有方法通过利用单模态和跨模态上下文来解决该问题。然而,我们认为在弱监督场景下,跨模态学习虽有利于检测可听可见事件,但会因引入无关模态信息而对未对齐的可听或可见事件产生负面影响。本文提出CoLeaF——一种新型学习框架,通过在嵌入空间中优化跨模态上下文的整合,使网络显式学习为可听可见事件融合跨模态信息,同时为未对齐事件过滤这些信息。此外,由于视频常涉及复杂类别关系,对其建模可提升性能,但这会增加网络额外计算成本。本框架设计在训练阶段利用跨类别关系,且推理时不引入额外计算。进一步地,我们提出新指标以更全面评估方法在AVVP任务中的能力。大量实验表明,CoLeaF在LLP和UnAV-100数据集上的F值分别平均提升1.9%和2.4%,显著改进了当前最优结果。