Joint-embedding self-supervised learning (SSL) commonly relies on transformations such as data augmentation and masking to learn visual representations, a task achieved by enforcing invariance or equivariance with respect to these transformations applied to two views of an image. This dominant two-view paradigm in SSL often limits the flexibility of learned representations for downstream adaptation by creating performance trade-offs between high-level invariance-demanding tasks such as image classification and more fine-grained equivariance-related tasks. In this work, we propose \emph{seq-JEPA}, a world modeling framework that introduces architectural inductive biases into joint-embedding predictive architectures to resolve this trade-off. Without relying on dual equivariance predictors or loss terms, seq-JEPA simultaneously learns two architecturally separate representations for equivariance- and invariance-demanding tasks. To do so, our model processes short sequences of different views (observations) of inputs. Each encoded view is concatenated with an embedding of the relative transformation (action) that produces the next observation in the sequence. These view-action pairs are passed through a transformer encoder that outputs an aggregate representation. A predictor head then conditions this aggregate representation on the upcoming action to predict the representation of the next observation. Empirically, seq-JEPA demonstrates strong performance on both equivariance- and invariance-demanding downstream tasks without sacrificing one for the other. Furthermore, it excels at tasks that inherently require aggregating a sequence of observations, such as path integration across actions and predictive learning across eye movements.
翻译:联合嵌入自监督学习通常依赖于数据增强和掩码等变换来学习视觉表示,其方法是对图像的两个视图施加这些变换,并强制表示具有不变性或等变性。这种自监督学习中占主导地位的双视图范式常常限制了学习表示在下游任务中的适应灵活性,因为它在需要高层不变性的任务(如图像分类)与需要更细粒度等变性的任务之间造成了性能权衡。本文提出 \emph{seq-JEPA},一种世界建模框架,该框架通过引入架构归纳偏置到联合嵌入预测架构中来解决这一权衡问题。seq-JEPA 不依赖双重等变预测器或损失项,而是同时学习两个架构上分离的表示,分别用于等变性需求和不变性需求的任务。为此,我们的模型处理输入的不同视图(观测)的短序列。每个编码后的视图与产生序列中下一个观测的相对变换(动作)的嵌入表示进行拼接。这些视图-动作对通过一个 Transformer 编码器,输出一个聚合表示。然后,一个预测器头基于此聚合表示,并以即将发生的动作为条件,来预测下一个观测的表示。实验表明,seq-JEPA 在需要等变性和不变性的下游任务上均表现出色,且无需牺牲其中一方。此外,它在本质上需要聚合观测序列的任务上表现优异,例如跨动作的路径整合和跨眼动的预测学习。