Audio-Visual Event Localization (AVEL) is the task of temporally localizing and classifying \emph{audio-visual events}, i.e., events simultaneously visible and audible in a video. In this paper, we solve AVEL in a weakly-supervised setting, where only video-level event labels (their presence/absence, but not their locations in time) are available as supervision for training. Our idea is to use a base model to estimate labels on the training data at a finer temporal resolution than at the video level and re-train the model with these labels. I.e., we determine the subset of labels for each \emph{slice} of frames in a training video by (i) replacing the frames outside the slice with those from a second video having no overlap in video-level labels, and (ii) feeding this synthetic video into the base model to extract labels for just the slice in question. To handle the out-of-distribution nature of our synthetic videos, we propose an auxiliary objective for the base model that induces more reliable predictions of the localized event labels as desired. Our three-stage pipeline outperforms several existing AVEL methods with no architectural changes and improves performance on a related weakly-supervised task as well.
翻译:音视频事件定位(AVEL)是指对视频中同时可见且可听的事件进行时间定位与分类的任务。本文在弱监督场景下解决AVEL问题,该场景仅以视频级事件标签(事件的存在/缺失,而非其时序位置)作为训练监督信号。我们的核心思想是:利用基础模型,以比视频级更细的时间分辨率来估计训练数据的标签,并运用这些标签重新训练模型。具体而言,我们通过以下步骤为训练视频中每一帧切片确定标签子集:(i)用另一个视频级标签无交集的视频帧替换切片外部帧;(ii)将该合成视频输入基础模型,仅提取目标切片的标签。为应对合成视频的分布外特性,我们提出基础模型的辅助目标函数,以诱导更可靠的局部事件标签预测。我们的三阶段流程在不改变架构的前提下超越了现有多种AVEL方法,并同时提升了相关弱监督任务的性能。