In this paper, we explore a new generative approach for learning visual representations. Our method, DARL, employs a decoder-only Transformer to predict image patches autoregressively. We find that training with Mean Squared Error (MSE) alone leads to strong representations. To enhance the image generation ability, we replace the MSE loss with the diffusion objective by using a denoising patch decoder. We show that the learned representation can be improved by using tailored noise schedules and longer training in larger models. Notably, the optimal schedule differs significantly from the typical ones used in standard image diffusion models. Overall, despite its simple architecture, DARL delivers performance remarkably close to state-of-the-art masked prediction models under the fine-tuning protocol. This marks an important step towards a unified model capable of both visual perception and generation, effectively combining the strengths of autoregressive and denoising diffusion models.
翻译:在本文中,我们探索了一种新的生成式视觉表示学习方法。我们的方法DARL采用仅有解码器的Transformer来自回归预测图像块。我们发现,仅使用均方误差(MSE)进行训练即可获得强大的表示。为了提升图像生成能力,我们通过引入去噪图像块解码器,用扩散目标替代了MSE损失。实验表明,通过采用定制化噪声调度以及在更大模型上进行更长时间的训练,可以改进学习到的表示。值得注意的是,最优调度与标准图像扩散模型中常用调度存在显著差异。总体而言,尽管架构简单,DARL在微调协议下的性能极为接近最先进的掩码预测模型。这标志着向能够同时实现视觉感知与生成的统一模型迈出了重要一步,有效结合了自回归模型与去噪扩散模型的优势。