Although text-to-image (T2I) models have recently thrived as visual generative priors, their reliance on high-quality text-image pairs makes scaling up expensive. We argue that grasping the cross-modality alignment is not a necessity for a sound visual generative prior, whose focus should be on texture modeling. Such a philosophy inspires us to study image-to-image (I2I) generation, where models can learn from in-the-wild images in a self-supervised manner. We first develop a pure vision-based training framework, Lumos, and confirm the feasibility and the scalability of learning I2I models. We then find that, as an upstream task of T2I, our I2I model serves as a more foundational visual prior and achieves on-par or better performance than existing T2I models using only 1/10 text-image pairs for fine-tuning. We further demonstrate the superiority of I2I priors over T2I priors on some text-irrelevant visual generative tasks, like image-to-3D and image-to-video.
翻译:尽管文本到图像(T2I)模型近来作为视觉生成先验蓬勃发展,但其对高质量文本-图像对的依赖使得规模化扩展成本高昂。我们认为,掌握跨模态对齐对于构建良好的视觉生成先验并非必需,其重点应在于纹理建模。这一理念启发我们研究图像到图像(I2I)生成,其中模型能够以自监督的方式从真实世界图像中学习。我们首先开发了一个纯视觉的训练框架Lumos,并证实了学习I2I模型的可行性与可扩展性。随后我们发现,作为T2I的上游任务,我们的I2I模型充当了更基础的视觉先验,仅使用1/10的文本-图像对进行微调,即可达到与现有T2I模型相当或更优的性能。我们进一步展示了I2I先验在一些与文本无关的视觉生成任务(如图像到3D和图像到视频)上相对于T2I先验的优越性。