Audio-Visual Video Parsing is a task to predict the events that occur in video segments for each modality. It often performs in a weakly supervised manner, where only video event labels are provided, i.e., the modalities and the timestamps of the labels are unknown. Due to the lack of densely annotated labels, recent work attempts to leverage pseudo labels to enrich the supervision. A commonly used strategy is to generate pseudo labels by categorizing the known event labels for each modality. However, the labels are still limited to the video level, and the temporal boundaries of event timestamps remain unlabeled. In this paper, we propose a new pseudo label generation strategy that can explicitly assign labels to each video segment by utilizing prior knowledge learned from the open world. Specifically, we exploit the CLIP model to estimate the events in each video segment based on visual modality to generate segment-level pseudo labels. A new loss function is proposed to regularize these labels by taking into account their category-richness and segmentrichness. A label denoising strategy is adopted to improve the pseudo labels by flipping them whenever high forward binary cross entropy loss occurs. We perform extensive experiments on the LLP dataset and demonstrate that our method can generate high-quality segment-level pseudo labels with the help of our newly proposed loss and the label denoising strategy. Our method achieves state-of-the-art audio-visual video parsing performance.
翻译:音视频解析是一项预测视频片段中各模态事件的任务。该任务通常以弱监督方式进行,仅提供视频事件标签(即标签的模态和时间戳未知)。由于缺乏密集标注标签,近期研究尝试利用伪标签来增强监督信号。一种常见策略是通过对已知事件标签按模态分类生成伪标签,但这些标签仍局限于视频级别,事件时间戳的时间边界仍未标注。本文提出一种新的伪标签生成策略,通过利用从开放世界中习得的先验知识,为每个视频片段显式分配标签。具体而言,我们基于视觉模态利用CLIP模型估计每个视频片段中的事件,生成片段级伪标签;提出一种新损失函数,通过考虑标签的类别丰富度和片段丰富度来约束这些标签;采用标签去噪策略,在正向二元交叉熵损失过高时翻转伪标签以优化其质量。在LLP数据集上的大量实验表明,结合新提出的损失函数与标签去噪策略,我们的方法能生成高质量的片段级伪标签,并实现了当前最优的音视频解析性能。