Existing audio-visual event localization (AVE) handles manually trimmed videos with only a single instance in each of them. However, this setting is unrealistic as natural videos often contain numerous audio-visual events with different categories. To better adapt to real-life applications, in this paper we focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video. The problem is challenging as it requires fine-grained audio-visual scene and context understanding. To tackle this problem, we introduce the first Untrimmed Audio-Visual (UnAV-100) dataset, which contains 10K untrimmed videos with over 30K audio-visual events. Each video has 2.8 audio-visual events on average, and the events are usually related to each other and might co-occur as in real-life scenes. Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass. Extensive experiments demonstrate the effectiveness of our method as well as the significance of multi-scale cross-modal perception and dependency modeling for this task.
翻译:现有的音视频事件定位(AVE)方法仅处理经过手动裁剪且每个视频只包含单个实例的样本。然而,这种设定并不符合实际场景——自然视频中往往包含大量不同类别的音视频事件。为了更好地适配现实应用,本文聚焦于密集定位音视频事件任务,旨在联合定位并识别未裁剪视频中出现的所有音视频事件。该问题具有挑战性,需要细粒度的音视频场景与上下文理解能力。为解决此问题,我们首次提出未裁剪音视频数据集(UnAV-100),包含10,000个未裁剪视频及超过30,000个音视频事件。每个视频平均包含2.8个音视频事件,这些事件通常相互关联且可能如现实场景般同时出现。随后,我们基于新型学习框架对该任务进行建模,该框架能充分融合音频与视觉模态,在单次处理过程中定位不同长度的音视频事件并捕捉其依赖关系。大量实验证明了本方法的有效性,以及多尺度跨模态感知与依赖建模对该任务的重要意义。