Audio-visual video highlight detection aims to automatically identify the most salient moments in videos by leveraging both visual and auditory cues. However, existing models often underutilize the audio modality, focusing on high-level semantic features while failing to fully leverage the rich, dynamic characteristics of sound. To address this limitation, we propose a novel framework, Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD). The dual-pathway audio encoder is composed of a semantic pathway for content understanding and a dynamic pathway that captures spectro-temporal dynamics. The semantic pathway extracts high-level information by identifying the content within the audio, such as speech, music, or specific sound events. The dynamic pathway employs a frequency-adaptive mechanism as time evolves to jointly model these dynamics, enabling it to identify transient acoustic events via salient spectral bands and rapid energy changes. We integrate the novel audio encoder into a full audio-visual framework and achieve new state-of-the-art performance on the large-scale Mr.HiSum benchmark. Our results demonstrate that a sophisticated, dual-faceted audio representation is key to advancing the field of highlight detection.
翻译:音视频高光检测旨在通过结合视觉与听觉线索,自动识别视频中最具显著性的片段。然而,现有模型往往未能充分利用音频模态,侧重于高层语义特征,而未能充分挖掘声音丰富、动态的特性。为克服这一局限,我们提出了一种新颖的框架——面向视频高光检测的双通路音频编码器(DAViHD)。该双通路音频编码器由用于内容理解的语义通路和捕捉谱时动态特征的动态通路构成。语义通路通过识别音频中的内容(如语音、音乐或特定声音事件)来提取高层信息。动态通路采用随时间演化的频率自适应机制,联合建模这些动态特征,使其能够通过显著频带和快速能量变化来识别瞬态声学事件。我们将这一新颖的音频编码器集成到一个完整的音视频框架中,并在大规模Mr.HiSum基准测试上取得了新的最先进性能。我们的结果表明,一种精细、双重视角的音频表征是推动高光检测领域发展的关键。