Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cross-modal tasks such as text-to-image generation, this same mapping from noise to image is learnt whilst including a conditioning mechanism in the model. One key and thus far relatively unexplored feature of flow matching is that, unlike Diffusion models, they are not constrained for the source distribution to be noise. Hence, in this paper, we propose a paradigm shift, and ask the question of whether we can instead train flow matching models to learn a direct mapping from the distribution of one modality to the distribution of another, thus obviating the need for both the noise distribution and conditioning mechanism. We present a general and simple framework, CrossFlow, for cross-modal flow matching. We show the importance of applying Variational Encoders to the input data, and introduce a method to enable Classifier-free guidance. Surprisingly, for text-to-image, CrossFlow with a vanilla transformer without cross attention slightly outperforms standard flow matching, and we show that it scales better with training steps and model size, while also allowing for interesting latent arithmetic which results in semantically meaningful edits in the output space. To demonstrate the generalizability of our approach, we also show that CrossFlow is on par with or outperforms the state-of-the-art for various cross-modal / intra-modal mapping tasks, viz. image captioning, depth estimation, and image super-resolution. We hope this paper contributes to accelerating progress in cross-modal media generation.

翻译：扩散模型及其推广形式——流匹配，对媒体生成领域产生了显著影响。传统方法通常学习从简单的高斯噪声源分布到目标媒体分布的复杂映射。对于文本到图像生成等跨模态任务，模型在保持从噪声到图像的映射学习的同时，还会引入条件机制。流匹配的一个关键且迄今尚未被充分探索的特性是，与扩散模型不同，其源分布并不受限于噪声分布。因此，本文提出一种范式转变，探讨是否能够训练流匹配模型来直接学习从一个模态分布到另一个模态分布的映射，从而同时消除对噪声分布和条件机制的需求。我们提出了一个通用且简单的跨模态流匹配框架——CrossFlow。我们展示了在输入数据上应用变分编码器的重要性，并引入了一种实现无分类器引导的方法。令人惊讶的是，在文本到图像任务中，使用不带交叉注意力的普通Transformer的CrossFlow略微优于标准流匹配方法。我们证明该方法在训练步数和模型规模方面具有更好的扩展性，同时支持有趣的潜在空间算术运算，从而在输出空间实现语义上有意义的编辑。为验证方法的普适性，我们还展示了CrossFlow在多种跨模态/模态内映射任务（如图像描述、深度估计和图像超分辨率）上达到或超越了当前最优性能。我们希望本文能为加速跨模态媒体生成的研究进展作出贡献。