Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration

In the field of audio-visual learning, most research tasks focus exclusively on short videos. This paper focuses on the more practical Dense Audio-Visual Event Localization (DAVEL) task, advancing audio-visual scene understanding for longer, untrimmed videos. This task seeks to identify and temporally pinpoint all events simultaneously occurring in both audio and visual streams. Typically, each video encompasses dense events of multiple classes, which may overlap on the timeline, each exhibiting varied durations. Given these challenges, effectively exploiting the audio-visual relations and the temporal features encoded at various granularities becomes crucial. To address these challenges, we introduce a novel CCNet, comprising two core modules: the Cross-Modal Consistency Collaboration (CMCC) and the Multi-Temporal Granularity Collaboration (MTGC). Specifically, the CMCC module contains two branches: a cross-modal interaction branch and a temporal consistency-gated branch. The former branch facilitates the aggregation of consistent event semantics across modalities through the encoding of audio-visual relations, while the latter branch guides one modality's focus to pivotal event-relevant temporal areas as discerned in the other modality. The MTGC module includes a coarse-to-fine collaboration block and a fine-to-coarse collaboration block, providing bidirectional support among coarse- and fine-grained temporal features. Extensive experiments on the UnAV-100 dataset validate our module design, resulting in a new state-of-the-art performance in dense audio-visual event localization. The code is available at https://github.com/zzhhfut/CCNet-AAAI2025.

翻译：在视听学习领域，多数研究任务仅聚焦于短视频。本文关注更具实用性的密集视听事件定位任务，旨在推进对更长、未经剪辑视频的视听场景理解。该任务旨在识别并时间定位在音频与视觉流中同时发生的所有事件。通常，每个视频包含多个类别的密集事件，这些事件在时间线上可能重叠，且各自呈现不同的持续时间。面对这些挑战，有效利用视听关系以及在不同粒度上编码的时序特征变得至关重要。为解决这些难题，我们提出了一种新颖的CCNet，其包含两个核心模块：跨模态一致性协同模块与多时序粒度协同模块。具体而言，跨模态一致性协同模块包含两个分支：跨模态交互分支与时序一致性门控分支。前者通过编码视听关系促进跨模态一致事件语义的聚合，而后者则引导一个模态关注由另一模态识别出的关键事件相关时序区域。多时序粒度协同模块包含一个由粗到细的协同块和一个由细到粗的协同块，为粗粒度和细粒度时序特征提供双向支持。在UnAV-100数据集上进行的大量实验验证了我们模块设计的有效性，从而在密集视听事件定位任务中实现了新的最先进性能。代码可在https://github.com/zzhhfut/CCNet-AAAI2025获取。