The joint-embedding predictive architecture (JEPA) recently has shown impressive results in extracting visual representations from unlabeled imagery under a masking strategy. However, we reveal its disadvantages, notably its insufficient understanding of local semantics. This deficiency originates from masked modeling in the embedding space, resulting in a reduction of discriminative power and can even lead to the neglect of critical local semantics. To bridge this gap, we introduce DMT-JEPA, a novel masked modeling objective rooted in JEPA, specifically designed to generate discriminative latent targets from neighboring information. Our key idea is simple: we consider a set of semantically similar neighboring patches as a target of a masked patch. To be specific, the proposed DMT-JEPA (a) computes feature similarities between each masked patch and its corresponding neighboring patches to select patches having semantically meaningful relations, and (b) employs lightweight cross-attention heads to aggregate features of neighboring patches as the masked targets. Consequently, DMT-JEPA demonstrates strong discriminative power, offering benefits across a diverse spectrum of downstream tasks. Through extensive experiments, we demonstrate our effectiveness across various visual benchmarks, including ImageNet-1K image classification, ADE20K semantic segmentation, and COCO object detection tasks. Code is available at: \url{https://github.com/DMTJEPA/DMTJEPA}.
翻译:联合嵌入预测架构(JEPA)近期在掩码策略下从未标记图像中提取视觉表征方面展现出令人瞩目的成果。然而,我们揭示了其不足之处,特别是其对局部语义理解的不充分。这一缺陷源于嵌入空间中的掩码建模,导致判别能力下降,甚至可能忽视关键的局部语义。为弥补这一差距,我们提出了DMT-JEPA,一种基于JEPA的新型掩码建模目标,专门设计用于从相邻信息中生成判别性潜在目标。我们的核心思想很简单:将一组语义相似的相邻图像块视为一个掩码图像块的目标。具体而言,所提出的DMT-JEPA(a)计算每个掩码图像块与其对应相邻图像块之间的特征相似度,以选择具有语义意义关系的图像块;(b)采用轻量级交叉注意力头来聚合相邻图像块的特征作为掩码目标。因此,DMT-JEPA展现出强大的判别能力,为广泛的下游任务带来益处。通过大量实验,我们在多种视觉基准测试中验证了其有效性,包括ImageNet-1K图像分类、ADE20K语义分割和COCO目标检测任务。代码发布于:\url{https://github.com/DMTJEPA/DMTJEPA}。