Audio-Visual Event Localization (AVEL) is the task of temporally localizing and classifying \emph{audio-visual events}, i.e., events simultaneously visible and audible in a video. In this paper, we solve AVEL in a weakly-supervised setting, where only video-level event labels (their presence/absence, but not their locations in time) are available as supervision for training. Our idea is to use a base model to estimate labels on the training data at a finer temporal resolution than at the video level and re-train the model with these labels. I.e., we determine the subset of labels for each \emph{slice} of frames in a training video by (i) replacing the frames outside the slice with those from a second video having no overlap in video-level labels, and (ii) feeding this synthetic video into the base model to extract labels for just the slice in question. To handle the out-of-distribution nature of our synthetic videos, we propose an auxiliary objective for the base model that induces more reliable predictions of the localized event labels as desired. Our three-stage pipeline outperforms several existing AVEL methods with no architectural changes and improves performance on a related weakly-supervised task as well.
翻译:音视频事件定位(AVEL)是一项任务,旨在对视频中同时可见和可听的音视频事件进行时间定位与分类。本文在弱监督设置下解决AVEL问题,其中仅以视频级事件标签(事件存在与否,而非时间位置)作为训练监督。我们的思路是:利用基础模型在训练数据上以比视频级更精细的时间分辨率估计标签,并基于这些标签重新训练模型。具体而言,我们针对训练视频中每一帧切片确定其标签子集,方法为:(i)将切片外部的帧替换为另一视频(其视频级标签与当前视频无重叠)的对应帧,(ii)将合成的视频输入基础模型,提取仅针对该切片的标签。为处理合成视频的分布外特性,我们为基础模型提出辅助目标函数,以诱导其按需输出更可靠的局部事件标签预测。该三阶段流水线在不改变网络架构的情况下,优于多种现有AVEL方法,并在相关弱监督任务中提升了性能。