Over the past several years, the synchronization between audio and visual signals has been leveraged to learn richer audio-visual representations. Aided by the large availability of unlabeled videos, many unsupervised training frameworks have demonstrated impressive results in various downstream audio and video tasks. Recently, Masked Audio-Video Learners (MAViL) has emerged as a state-of-the-art audio-video pre-training framework. MAViL couples contrastive learning with masked autoencoding to jointly reconstruct audio spectrograms and video frames by fusing information from both modalities. In this paper, we study the potential synergy between diffusion models and MAViL, seeking to derive mutual benefits from these two frameworks. The incorporation of diffusion into MAViL, combined with various training efficiency methodologies that include the utilization of a masking ratio curriculum and adaptive batch sizing, results in a notable 32% reduction in pre-training Floating-Point Operations (FLOPS) and an 18% decrease in pre-training wall clock time. Crucially, this enhanced efficiency does not compromise the model's performance in downstream audio-classification tasks when compared to MAViL's performance.
翻译:在过去几年中,音频与视觉信号之间的同步性被用于学习更丰富的音视频表征。借助大量未标注视频的可用性,许多无监督训练框架在各类下游音频和视频任务中展现了显著成果。近期,掩码音频-视频学习器(MAViL)作为先进的音视频预训练框架出现。MAViL通过对比学习与掩码自编码相结合,融合双模态信息联合重建音频频谱图和视频帧。本文研究扩散模型与MAViL之间的潜在协同作用,旨在从这两个框架中获取互惠优势。将扩散机制融入MAViL,并配合多种训练效率方法论(包括掩码率课程学习策略与自适应批次大小调整),使得预训练浮点运算次数(FLOPS)降低32%,预训练耗时减少18%。至关重要的是,与MAViL的性能相比,这种效率提升并未损害模型在下游音频分类任务中的表现。