We present Attend-Fusion, a novel and efficient approach for audio-visual fusion in video classification tasks. Our method addresses the challenge of exploiting both audio and visual modalities while maintaining a compact model architecture. Through extensive experiments on the YouTube-8M dataset, we demonstrate that our Attend-Fusion achieves competitive performance with significantly reduced model complexity compared to larger baseline models.
翻译:本文提出Attend-Fusion,一种用于视频分类任务的新型高效音视频融合方法。该方法解决了在保持紧凑模型架构的同时有效利用音频与视觉模态的挑战。通过在YouTube-8M数据集上的大量实验,我们证明相较于更大的基线模型,Attend-Fusion能以显著降低的模型复杂度获得具有竞争力的性能。