This report proposes an improved method for the Temporal Sound Localisation (TSL) task, which localizes and classifies the sound events occurring in the video according to a predefined set of sound classes. The champion solution from last year's first competition has explored the TSL by fusing audio and video modalities with the same weight. Considering the TSL task aims to localize sound events, we conduct relevant experiments that demonstrated the superiority of sound features (Section 3). Based on our findings, to enhance audio modality features, we employ various models to extract audio features, such as InterVideo, CaVMAE, and VideoMAE models. Our approach ranks first in the final test with a score of 0.4925.
翻译:本报告针对时序声音定位任务提出了一种改进方法,该任务旨在根据预定义的声音类别集对视频中发生的声音事件进行定位与分类。去年首届竞赛的冠军方案通过等权重融合音频与视频模态对TSL任务进行了探索。考虑到TSL任务的核心目标是定位声音事件,我们开展了相关实验以验证声音特征的优越性(第3节)。基于此发现,为增强音频模态特征,我们采用多种模型提取音频特征,例如InterVideo、CaVMAE及VideoMAE模型。我们的方法在最终测试中以0.4925的得分位列第一。