Over the past several years, the synchronization between audio and visual signals has been leveraged to learn richer audio-visual representations. Aided by the large availability of unlabeled videos, many unsupervised training frameworks have demonstrated impressive results in various downstream audio and video tasks. Recently, Masked Audio-Video Learners (MAViL) has emerged as a state-of-the-art audio-video pre-training framework. MAViL couples contrastive learning with masked autoencoding to jointly reconstruct audio spectrograms and video frames by fusing information from both modalities. In this paper, we study the potential synergy between diffusion models and MAViL, seeking to derive mutual benefits from these two frameworks. The incorporation of diffusion into MAViL, combined with various training efficiency methodologies that include the utilization of a masking ratio curriculum and adaptive batch sizing, results in a notable 32% reduction in pre-training Floating-Point Operations (FLOPS) and an 18% decrease in pre-training wall clock time. Crucially, this enhanced efficiency does not compromise the model's performance in downstream audio-classification tasks when compared to MAViL's performance.
翻译:近年来,音频与视觉信号间的同步特性被用于学习更丰富的音视频表征。得益于大量无标签视频数据的可用性,多种无监督训练框架已在各类下游音频与视频任务中取得显著成果。最近,掩蔽音频-视频学习器(MAViL)作为先进的音视频预训练框架出现,其通过对比学习与掩蔽自编码的耦合,融合双模态信息联合重构音频频谱图和视频帧。本文研究扩散模型与MAViL之间的潜在协同效应,旨在从这两个框架中获取相互增益。将扩散机制融入MAViL,并结合掩蔽率课程学习与自适应批量尺寸等多种训练效率优化方法,使预训练浮点运算量(FLOPS)降低32%,预训练实际时间减少18%。关键的是,相较于MAViL,这种效率提升并未损害模型在下游音频分类任务中的性能表现。