We propose Fast Language-Audio Pre-training (FLAP), a self-supervised approach that efficiently and effectively learns aligned audio and language representations through masking, contrastive learning and reconstruction. For efficiency, FLAP randomly drops audio spectrogram tokens, focusing solely on the remaining ones for self-supervision. Through inter-modal contrastive learning, FLAP learns to align paired audio and text representations in a shared latent space. Notably, FLAP leverages multiple augmented views via masking for inter-modal contrast and learns to reconstruct the masked portion of audio tokens. Moreover, FLAP leverages large language models (LLMs) to augment the text inputs, contributing to improved performance. These approaches lead to more robust and informative audio-text representations, enabling FLAP to achieve state-of-the-art (SoTA) performance on audio-text retrieval tasks on AudioCaps (achieving 53.0% R@1) and Clotho (achieving 25.5% R@1).
翻译:我们提出快速语言-音频预训练(FLAP),这是一种通过掩码、对比学习和重构高效且有效学习对齐的音频与语言表征的自监督方法。为提升效率,FLAP随机丢弃音频语谱图令牌,仅对剩余令牌进行自监督处理。通过跨模态对比学习,FLAP在共享隐空间中学习对齐配对音频与文本表征。值得注意的是,FLAP利用掩码生成多种增强视图以实施跨模态对比,并学习重构音频令牌的掩码部分。此外,FLAP借助大型语言模型(LLMs)增强文本输入,从而提升性能。这些方法产生了更鲁棒且信息丰富的音频-文本表征,使FLAP在AudioCaps(R@1达53.0%)和Clotho(R@1达25.5%)的音频-文本检索任务中达到最先进(SoTA)性能。