Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling

The Audio-Visual Video Parsing task aims to identify and temporally localize the events that occur in either or both the audio and visual streams of audible videos. It often performs in a weakly-supervised manner, where only video event labels are provided, \ie, the modalities and the timestamps of the labels are unknown. Due to the lack of densely annotated labels, recent work attempts to leverage pseudo labels to enrich the supervision. A commonly used strategy is to generate pseudo labels by categorizing the known video event labels for each modality. However, the labels are still confined to the video level, and the temporal boundaries of events remain unlabeled. In this paper, we propose a new pseudo label generation strategy that can explicitly assign labels to each video segment by utilizing prior knowledge learned from the open world. Specifically, we exploit the large-scale pretrained models, namely CLIP and CLAP, to estimate the events in each video segment and generate segment-level visual and audio pseudo labels, respectively. We then propose a new loss function to exploit these pseudo labels by taking into account their category-richness and segment-richness. A label denoising strategy is also adopted to further improve the visual pseudo labels by flipping them whenever abnormally large forward losses occur. We perform extensive experiments on the LLP dataset and demonstrate the effectiveness of each proposed design and we achieve state-of-the-art video parsing performance on all types of event parsing, \ie, audio event, visual event, and audio-visual event. We also examine the proposed pseudo label generation strategy on a relevant weakly-supervised audio-visual event localization task and the experimental results again verify the benefits and generalization of our method.

翻译：音视频解析任务旨在识别并时序定位可听视频中音频流和/或视觉流发生的事件。该任务通常以弱监督方式进行，即仅提供视频事件标签（模态与时间戳未知）。由于缺乏密集标注，近期研究尝试利用伪标签增强监督。常用策略是通过对已知视频事件标签按模态分类生成伪标签，但此类标签仍局限于视频层级，事件的时间边界仍未标注。本文提出一种新的伪标签生成策略，能够利用开放世界先验知识为每个视频片段显式分配标签。具体而言，我们利用大规模预训练模型CLIP与CLAP分别估计各视频片段的事件，生成片段级视觉与音频伪标签。进而提出新的损失函数，通过考量伪标签的类别丰富度与片段丰富度来利用这些标签。同时采用标签去噪策略，在出现异常大的前向损失时翻转视觉伪标签以提升其质量。我们在LLP数据集上进行了大量实验，验证了每个设计模块的有效性，并在所有事件解析类型（音频事件、视觉事件、音视频事件）上取得了最先进的视频解析性能。此外，我们在相关的弱监督音视频事件定位任务中检验了所提出的伪标签生成策略，实验结果再次验证了本方法的优势与泛化能力。