基于能量的联合嵌入预测架构轻量级库 (A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures)

Basile Terver,Randall Balestriero,Megi Dervishi,David Fan,Quentin Garrido,Tushar Nagarajan,Koustuv Sinha,Wancong Zhang,Mike Rabbat,Yann LeCun,Amir Bar

We present EB-JEPA, an open-source library for learning representations and world models using Joint-Embedding Predictive Architectures (JEPAs). JEPAs learn to predict in representation space rather than pixel space, avoiding the pitfalls of generative modeling while capturing semantically meaningful features suitable for downstream tasks. Our library provides modular, self-contained implementations that illustrate how representation learning techniques developed for image-level self-supervised learning can transfer to video, where temporal dynamics add complexity, and ultimately to action-conditioned world models, where the model must additionally learn to predict the effects of control inputs. Each example is designed for single-GPU training within a few hours, making energy-based self-supervised learning accessible for research and education. We provide ablations of JEA components on CIFAR-10. Probing these representations yields 91% accuracy, indicating that the model learns useful features. Extending to video, we include a multi-step prediction example on Moving MNIST that demonstrates how the same principles scale to temporal modeling. Finally, we show how these representations can drive action-conditioned world models, achieving a 97% planning success rate on the Two Rooms navigation task. Comprehensive ablations reveal the critical importance of each regularization component for preventing representation collapse. Code is available at https://github.com/facebookresearch/eb_jepa.

翻译：本文介绍了EB-JEPA——一个基于联合嵌入预测架构（JEPAs）学习表征与世界模型的开源库。JEPAs在表征空间而非像素空间进行预测，避免了生成式建模的缺陷，同时捕获适用于下游任务的语义特征。本库提供模块化、自包含的实现，展示了为图像级自监督学习开发的表征学习技术如何迁移至视频领域（其中时序动态性增加了复杂性），并最终应用于动作条件化世界模型（模型需额外学习预测控制输入的影响）。每个示例均设计为可在单GPU上数小时内完成训练，使基于能量的自监督学习更易于研究与教学。我们在CIFAR-10上对JEA组件进行了消融实验，表征探测准确率达到91%，表明模型学习了有效特征。扩展至视频领域，我们提供了在Moving MNIST上的多步预测示例，展示了相同原理如何扩展至时序建模。最后，我们演示了这些表征如何驱动动作条件化世界模型，在Two Rooms导航任务中实现了97%的规划成功率。系统消融实验揭示了每个正则化组件对防止表征坍塌的关键作用。代码发布于https://github.com/facebookresearch/eb_jepa。