LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

The video-language (VL) pretraining has achieved remarkable improvement in multiple downstream tasks. However, the current VL pretraining framework is hard to extend to multiple modalities (N modalities, N>=3) beyond vision and language. We thus propose LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics. Specifically, we freeze the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning. As a result, all modalities are mapped to a shared feature space, implementing multi-modal semantic alignment. While LanguageBind ensures that we can extend VL modalities to N modalities, we also need a high-quality dataset with alignment data pairs centered on language. We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M. In our VIDAL-10M, all videos are from short video platforms with complete semantics rather than truncated segments from long videos, and all the video, depth, infrared, and audio modalities are aligned to their textual descriptions. After pretraining on VIDAL-10M, we outperform ImageBind by 1.2% R@1 on the MSR-VTT dataset with only 15% of the parameters in the zero-shot video-text retrieval, validating the high quality of our dataset. Beyond this, our LanguageBind has achieved great improvement in the zero-shot video, audio, depth, and infrared understanding tasks. For instance, on the LLVIP and NYU-D datasets, LanguageBind outperforms ImageBind-huge with 23.8% and 11.1% top-1 accuracy.

翻译：视频-语言（VL）预训练在多个下游任务中取得了显著提升。然而，当前的VL预训练框架难以扩展到视觉和语言之外的多模态（N模态，N≥3）。为此，我们提出LanguageBind，以语言作为不同模态之间的纽带，因为语言模态已被充分探索且蕴含丰富语义。具体而言，我们冻结通过VL预训练获得的语言编码器，随后采用对比学习训练其他模态的编码器。这使得所有模态被映射到一个共享的特征空间，实现多模态语义对齐。尽管LanguageBind确保我们能将VL模态扩展至N模态，我们仍需要一个以语言为中心的高质量对齐数据对数据集。因此，我们提出包含视频、红外、深度、音频及其对应语言的VIDAL-10M数据集。在VIDAL-10M中，所有视频均来自具有完整语义的短视频平台（而非长视频的截取片段），且视频、深度、红外和音频模态均与其文本描述对齐。在VIDAL-10M上预训练后，我们在零样本视频-文本检索任务中以仅15%的参数在MSR-VTT数据集上超越ImageBind 1.2%的R@1指标，验证了数据集的高质量。此外，我们的LanguageBind在零样本视频、音频、深度和红外理解任务上均取得显著提升。例如，在LLVIP和NYU-D数据集中，LanguageBind以23.8%和11.1%的top-1准确率超越ImageBind-huge。