LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

The video-language (VL) pretraining has achieved remarkable improvement in multiple downstream tasks. However, the current VL pretraining framework is hard to extend to multiple modalities (N modalities, N>=3) beyond vision and language. We thus propose LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics. Specifically, we freeze the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning. As a result, all modalities are mapped to a shared feature space, implementing multi-modal semantic alignment. While LanguageBind ensures that we can extend VL modalities to N modalities, we also need a high-quality dataset with alignment data pairs centered on language. We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M. In our VIDAL-10M, all videos are from short video platforms with complete semantics rather than truncated segments from long videos, and all the video, depth, infrared, and audio modalities are aligned to their textual descriptions. After pretraining on VIDAL-10M, we outperform ImageBind by 1.2% R@1 on the MSR-VTT dataset with only 15% of the parameters in the zero-shot video-text retrieval, validating the high quality of our dataset. Beyond this, our LanguageBind has achieved great improvement in the zero-shot video, audio, depth, and infrared understanding tasks. For instance, on the LLVIP and NYU-D datasets, LanguageBind outperforms ImageBind-huge with 23.8% and 11.1% top-1 accuracy. Code address: https://github.com/PKU-YuanGroup/LanguageBind.

翻译：视频-语言（VL）预训练在多项下游任务中取得了显著提升。然而，当前VL预训练框架难以扩展到视觉与语言之外的多模态（N个模态，N>=3）。为此，我们提出LanguageBind，以语言作为不同模态之间的纽带，因为语言模态已被充分探索且蕴含丰富的语义信息。具体而言，我们冻结通过VL预训练获得的语言编码器，随后通过对比学习训练其他模态的编码器。由此，所有模态被映射到共享特征空间，实现多模态语义对齐。尽管LanguageBind能确保将VL模态扩展到N个模态，我们还需要一个以语言为中心的对齐数据对构成的高质量数据集。因此，我们提出包含视频、红外、深度、音频及其对应语言的VIDAL-10M数据集，命名为VIDAL-10M。在我们的VIDAL-10M中，所有视频均来自具有完整语义的短视频平台（而非长视频的截取片段），且所有视频、深度、红外和音频模态均与其文本描述对齐。在VIDAL-10M上预训练后，我们在零样本视频-文本检索任务中以ImageBind参数量的15%实现了MSR-VTT数据集上R@1指标1.2%的提升，验证了数据集的高质量。此外，我们的LanguageBind在零样本视频、音频、深度和红外理解任务中均取得显著改进。例如，在LLVIP和NYU-D数据集上，LanguageBind以23.8%和11.1%的top-1准确率超越ImageBind-huge。代码地址：https://github.com/PKU-YuanGroup/LanguageBind。