We focus on the weakly-supervised audio-visual video parsing task (AVVP), which aims to identify and locate all the events in audio/visual modalities. Previous works only concentrate on video-level overall label denoising across modalities, but overlook the segment-level label noise, where adjacent video segments (i.e., 1-second video clips) may contain different events. However, recognizing events in the segment is challenging because its label could be any combination of events that occur in the video. To address this issue, we consider tackling AVVP from the language perspective, since language could freely describe how various events appear in each segment beyond fixed labels. Specifically, we design language prompts to describe all cases of event appearance for each video. Then, the similarity between language prompts and segments is calculated, where the event of the most similar prompt is regarded as the segment-level label. In addition, to deal with the mislabeled segments, we propose to perform dynamic re-weighting on the unreliable segments to adjust their labels. Experiments show that our simple yet effective approach outperforms state-of-the-art methods by a large margin.
翻译:我们聚焦于弱监督视听视频解析任务(AVVP),该任务旨在识别并定位音频/视觉模态中的所有事件。以往的研究仅关注跨模态的全局视频标签降噪,却忽略了片段级标签噪声问题——相邻视频片段(即1秒长的视频剪辑)可能包含不同的事件。然而,识别片段中的事件极具挑战性,因为其标签可能是视频中所有事件发生的任意组合。为解决这一问题,我们从语言视角重新审视AVVP任务:语言能够超越固定标签形式,自由描述各片段中不同事件的出现情况。具体而言,我们设计语言提示来描述每个视频中事件出现的所有可能情形,随后计算语言提示与视频片段的相似度,将最相似提示对应的事件视为该片段的片段级标签。此外,针对错误标注的片段,我们提出对不可靠片段进行动态权重调整以修正其标签。实验表明,我们这种简洁而高效的方法大幅超越了现有最优方法。