We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.
翻译:本文提出VL-JEPA,一种基于联合嵌入预测架构的视觉-语言模型。与传统视觉语言模型的自回归词元生成方式不同,VL-JEPA直接预测目标文本的连续嵌入表示。通过在抽象表示空间中学习,该模型能够聚焦于任务相关的语义信息,同时忽略表层语言变异。在采用相同视觉编码器与训练数据的严格对照实验中,VL-JEPA以可训练参数量减少50%的配置,实现了优于标准词元空间视觉语言模型训练方法的性能。在推理阶段,仅当需要将VL-JEPA预测的嵌入转换为文本时,才会调用轻量级文本解码器。实验表明,VL-JEPA原生支持选择性解码机制,在保持与非自适应均匀解码相当性能的前提下,将解码操作次数降低至原来的2.85分之一。除生成任务外,VL-JEPA的嵌入空间无需任何架构修改即可天然支持开放词汇分类、文本-视频检索及判别式视觉问答任务。在八个视频分类与八个视频检索数据集上,VL-JEPA的平均性能超越CLIP、SigLIP2及Perception Encoder。同时,该模型在GQA、TallyQA、POPE和POPEv2四个视觉问答数据集上,仅以16亿参数规模即达到与经典视觉语言模型相当的性能水平。