Audio-visual video highlight detection aims to automatically identify the most salient moments in videos by leveraging both visual and auditory cues. However, existing models often underutilize the audio modality, focusing on high-level semantic features while failing to fully leverage the rich, dynamic characteristics of sound. To address this limitation, we propose a novel framework, Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD). The dual-pathway audio encoder is composed of a semantic pathway for content understanding and a dynamic pathway that captures spectro-temporal dynamics. The semantic pathway extracts high-level information by identifying the content within the audio, such as speech, music, or specific sound events. The dynamic pathway employs a frequency-adaptive mechanism as time evolves to jointly model these dynamics, enabling it to identify transient acoustic events via salient spectral bands and rapid energy changes. We integrate the novel audio encoder into a full audio-visual framework and achieve new state-of-the-art performance on the large-scale MrHiSum benchmark. Our results demonstrate that a sophisticated, dual-faceted audio representation is key to advancing the field of highlight detection.
翻译:视听视频高光片段检测旨在通过利用视觉和听觉线索,自动识别视频中最显著的片段。然而,现有模型通常未能充分利用音频模态,侧重于高层语义特征,而未能充分挖掘声音丰富、动态的特性。为克服这一局限,我们提出了一种新颖的框架——用于视频高光检测的双通路音频编码器。该双通路音频编码器由一个用于内容理解的语义通路和一个捕捉谱时动态特性的动态通路组成。语义通路通过识别音频中的内容(如语音、音乐或特定声音事件)来提取高层信息。动态通路则采用一种随时间演化的频率自适应机制来联合建模这些动态特性,使其能够通过显著的频带和快速的能量变化来识别瞬态声学事件。我们将这种新颖的音频编码器集成到一个完整的视听框架中,并在大规模MrHiSum基准测试上取得了新的最先进性能。我们的结果表明,一种精细、双面的音频表征是推动高光检测领域发展的关键。