LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

The video-language (VL) pretraining has achieved remarkable improvement in multiple downstream tasks. However, the current VL pretraining framework is hard to extend to multiple modalities (N modalities, N>=3) beyond vision and language. We thus propose LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics. Specifically, we freeze the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning. As a result, all modalities are mapped to a shared feature space, implementing multi-modal semantic alignment. While LanguageBind ensures that we can extend VL modalities to N modalities, we also need a high-quality dataset with alignment data pairs centered on language. We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M. In our VIDAL-10M, all videos are from short video platforms with complete semantics rather than truncated segments from long videos, and all the video, depth, infrared, and audio modalities are aligned to their textual descriptions. After pretraining on VIDAL-10M, we outperform ImageBind by 5.8% R@1 on the MSR-VTT dataset with only 15% of the parameters in the zero-shot video-text retrieval task. Beyond this, our LanguageBind has greatly improved in the zero-shot video, audio, depth, and infrared understanding tasks. For instance, LanguageBind surpassing InterVideo by 1.9% on MSR-VTT, 8.8% on MSVD, 6.3% on DiDeMo, and 4.4% on ActivityNet. On the LLVIP and NYU-D datasets, LanguageBind outperforms ImageBind with 23.8% and 11.1% top-1 accuracy. Code address: https://github.com/PKU-YuanGroup/LanguageBind.

翻译：视频-语言（VL）预训练已在多个下游任务中取得了显著提升。然而，当前的VL预训练框架难以扩展到视觉和语言之外的多模态（N模态，N≥3）。为此，我们提出LanguageBind，以语言作为不同模态间的绑定纽带，因为语言模态已被充分探索且蕴含丰富语义。具体而言，我们冻结通过VL预训练获得的语言编码器，然后通过对比学习训练其他模态的编码器。由此，所有模态被映射至共享特征空间，实现多模态语义对齐。虽然LanguageBind确保我们能将VL模态扩展至N模态，我们还需要一个以语言为中心的对齐数据对构成的高质量数据集。因此，我们提出了VIDAL-10M数据集，包含视频、红外、深度、音频及其对应语言，命名为VIDAL-10M。在我们的VIDAL-10M中，所有视频均来自具备完整语义的短视频平台（而非长视频截取片段），且视频、深度、红外、音频模态均与其文本描述对齐。在VIDAL-10M上预训练后，我们在MSR-VTT数据集的零样本视频-文本检索任务中仅使用ImageBind 15%的参数量，便取得了R@1指标5.8%的提升。此外，我们的LanguageBind在零样本视频、音频、深度和红外理解任务中均有显著提升。例如，LanguageBind在MSR-VTT、MSVD、DiDeMo和ActivityNet数据集上分别超越InterVideo 1.9%、8.8%、6.3%和4.4%。在LLVIP和NYU-D数据集上，LanguageBind以23.8%和11.1%的top-1准确率优于ImageBind。代码地址：https://github.com/PKU-YuanGroup/LanguageBind。