Leveraging representation encoders for generative modeling offers a path for efficient, high-fidelity synthesis. However, standard diffusion transformers fail to converge on these representations directly. While recent work attributes this to a capacity bottleneck proposing computationally expensive width scaling of diffusion transformers we demonstrate that the failure is fundamentally geometric. We identify Geometric Interference as the root cause: standard Euclidean flow matching forces probability paths through the low-density interior of the hyperspherical feature space of representation encoders, rather than following the manifold surface. To resolve this, we propose Riemannian Flow Matching with Jacobi Regularization (RJF). By constraining the generative process to the manifold geodesics and correcting for curvature-induced error propagation, RJF enables standard Diffusion Transformer architectures to converge without width scaling. Our method RJF enables the standard DiT-B architecture (131M parameters) to converge effectively, achieving an FID of 3.37 where prior methods fail to converge. Code: https://github.com/amandpkr/RJF
翻译:利用表征编码器进行生成建模为实现高效、高保真合成提供了路径。然而,标准扩散Transformer无法直接在这些表征上收敛。尽管近期研究将此归因于容量瓶颈,并提出通过计算代价高昂的扩散Transformer宽度扩展来解决,我们证明其失败本质上是几何性的。我们指出几何干扰是根本原因:标准欧几里得流匹配迫使概率路径穿过表征编码器超球形特征空间的低密度内部区域,而非沿流形表面行进。为解决此问题,我们提出带雅可比正则化的黎曼流匹配(RJF)。通过将生成过程约束在流形测地线上并修正曲率引起的误差传播,RJF使标准扩散Transformer架构无需宽度扩展即可收敛。我们的方法RJF使标准DiT-B架构(1.31亿参数)有效收敛,在先前方法无法收敛的情况下达到FID 3.37。代码:https://github.com/amandpkr/RJF