Flow matching has recently emerged as a promising alternative to diffusion-based generative models, particularly for text-to-image generation. Despite its flexibility in allowing arbitrary source distributions, most existing approaches rely on a standard Gaussian distribution, a choice inherited from diffusion models, and rarely consider the source distribution itself as an optimization target in such settings. In this work, we show that principled design of the source distribution is not only feasible but also beneficial at the scale of modern text-to-image systems. Specifically, we propose learning a condition-dependent source distribution under flow matching objective that better exploit rich conditioning signals. We identify key failure modes that arise when directly incorporating conditioning into the source, including distributional collapse and instability, and show that appropriate variance regularization and directional alignment between source and target are critical for stable and effective learning. We further analyze how the choice of target representation space impacts flow matching with structured sources, revealing regimes in which such designs are most effective. Extensive experiments across multiple text-to-image benchmarks demonstrate consistent and robust improvements, including up to a 3x faster convergence in FID, highlighting the practical benefits of a principled source distribution design for conditional flow matching.
翻译:流匹配最近已成为基于扩散的生成模型的一种有前景的替代方案,特别是在文本到图像生成领域。尽管它在允许任意源分布方面具有灵活性,但大多数现有方法依赖于标准高斯分布——这一选择继承自扩散模型,并且很少将源分布本身视为此类设置中的优化目标。在本工作中,我们表明,在现代文本到图像系统的规模下,对源分布进行有原则的设计不仅是可行的,而且是有益的。具体来说,我们提出在流匹配目标下学习一个条件依赖的源分布,以更好地利用丰富的条件信号。我们识别了将条件直接纳入源分布时出现的关键失效模式,包括分布坍缩和不稳定性,并表明适当的方差正则化以及源分布与目标分布之间的方向对齐对于稳定有效的学习至关重要。我们进一步分析了目标表示空间的选择如何影响具有结构化源分布的流匹配,揭示了此类设计最有效的机制。在多个文本到图像基准测试上进行的大量实验证明了一致且稳健的改进,包括FID收敛速度提升高达3倍,突显了有原则的源分布设计对于条件流匹配的实际益处。