The video-language (VL) pretraining has achieved remarkable improvement in multiple downstream tasks. However, the current VL pretraining framework is hard to extend to multiple modalities (N modalities, N>=3) beyond vision and language. We thus propose LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics. Specifically, we freeze the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning. As a result, all modalities are mapped to a shared feature space, implementing multi-modal semantic alignment. While LanguageBind ensures that we can extend VL modalities to N modalities, we also need a high-quality dataset with alignment data pairs centered on language. We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M. In our VIDAL-10M, all videos are from short video platforms with complete semantics rather than truncated segments from long videos, and all the video, depth, infrared, and audio modalities are aligned to their textual descriptions. LanguageBind has achieved superior performance on a wide range of 15 benchmarks covering video, audio, depth, and infrared. Moreover, multiple experiments have provided evidence for the effectiveness of LanguageBind in achieving indirect alignment and complementarity among diverse modalities. Code address: https://github.com/PKU-YuanGroup/LanguageBind
翻译:视频-语言(VL)预训练在多个下游任务中取得了显著提升。然而,当前的VL预训练框架难以扩展到视觉和语言之外的多模态(N模态,N≥3)。为此,我们提出LanguageBind,以语言作为跨不同模态的绑定媒介,因为语言模态已被充分探索且蕴含丰富的语义信息。具体而言,我们冻结通过VL预训练获得的语言编码器,然后通过对比学习训练其他模态的编码器。由此,所有模态被映射到共享特征空间,实现多模态语义对齐。LanguageBind确保我们能够将VL模态扩展到N模态,同时还需要一个以语言为中心的对齐数据对高质量数据集。为此,我们提出包含视频、红外、深度、音频及其对应语言的VIDAL-10M数据集(命名来源于其包含的视频、红外、深度、音频和语言)。在VIDAL-10M中,所有视频均来自具有完整语义的短视频平台(而非长视频的截取片段),且所有视频、深度、红外、音频模态均与文本描述对齐。LanguageBind在涵盖视频、音频、深度和红外的15个广泛基准测试中均取得了优越性能。此外,多项实验证明LanguageBind在实现不同模态间间接对齐和互补性方面的有效性。代码地址:https://github.com/PKU-YuanGroup/LanguageBind