We propose Terminal Velocity Matching (TVM), a generalization of flow matching that enables high-fidelity one- and few-step generative modeling. TVM models the transition between any two diffusion timesteps and regularizes its behavior at its terminal time rather than at the initial time. We prove that TVM provides an upper bound on the $2$-Wasserstein distance between data and model distributions when the model is Lipschitz continuous. However, since Diffusion Transformers lack this property, we introduce minimal architectural changes that achieve stable, single-stage training. To make TVM efficient in practice, we develop a fused attention kernel that supports backward passes on Jacobian-Vector Products, which scale well with transformer architectures. On ImageNet-256x256, TVM achieves 3.29 FID with a single function evaluation (NFE) and 1.99 FID with 4 NFEs. It similarly achieves 4.32 1-NFE FID and 2.94 4-NFE FID on ImageNet-512x512, representing state-of-the-art performance for one/few-step models from scratch.
翻译:我们提出终端速度匹配(TVM),这是流匹配的一种推广,能够实现高保真度的单步和少步生成建模。TVM模拟任意两个扩散时间步之间的转移,并在其终端时间而非初始时间对其行为进行正则化。我们证明,当模型满足Lipschitz连续性时,TVM为数据分布与模型分布之间的$2$-Wasserstein距离提供了一个上界。然而,由于扩散变换器(Diffusion Transformers)缺乏此性质,我们引入了最小的架构改动以实现稳定的单阶段训练。为使TVM在实践中高效,我们开发了一种融合注意力核,该核支持在雅可比-向量积上进行反向传播,并能很好地适应变换器架构的规模。在ImageNet-256x256上,TVM在单次函数评估(NFE)下实现了3.29的FID,在4次NFE下实现了1.99的FID。在ImageNet-512x512上,它同样实现了4.32的单NFE FID和2.94的4-NFE FID,代表了从零开始训练的单步/少步模型的最先进性能。