We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a randomly chosen ratio, and a decoder conditioned on visual features is trained to reconstruct the original text. After training, the learned visual features can be applied to downstream vision tasks. Unlike autoregressive captioning, the strength of the visual learning signal in MDC does not depend on each token's position in the sequence, reducing the need for auxiliary objectives. Linear probing experiments across a variety of academic-scale models and datasets show that the learned visual features are competitive with those produced by autoregressive and contrastive approaches.
翻译:我们通过使用图像条件掩码扩散语言模型为图像生成描述来学习视觉特征,这一方法称为掩码扩散式图像描述(MDC)。在训练过程中,每个图像-描述对中的文本标记会按随机选择的比例进行掩码,并训练一个以视觉特征为条件的解码器来重建原始文本。训练完成后,学习到的视觉特征可应用于下游视觉任务。与自回归式图像描述不同,MDC中视觉学习信号的强度不依赖于每个标记在序列中的位置,从而减少了对辅助目标的需求。在各种学术规模模型和数据集上的线性探测实验表明,学习到的视觉特征与自回归和对比方法生成的特征具有竞争力。