Discrete diffusion or flow models could enable faster and more controllable sequence generation than autoregressive models. We show that na\"ive linear flow matching on the simplex is insufficient toward this goal since it suffers from discontinuities in the training target and further pathologies. To overcome this, we develop Dirichlet flow matching on the simplex based on mixtures of Dirichlet distributions as probability paths. In this framework, we derive a connection between the mixtures' scores and the flow's vector field that allows for classifier and classifier-free guidance. Further, we provide distilled Dirichlet flow matching, which enables one-step sequence generation with minimal performance hits, resulting in $O(L)$ speedups compared to autoregressive models. On complex DNA sequence generation tasks, we demonstrate superior performance compared to all baselines in distributional metrics and in achieving desired design targets for generated sequences. Finally, we show that our classifier-free guidance approach improves unconditional generation and is effective for generating DNA that satisfies design targets. Code is available at https://github.com/HannesStark/dirichlet-flow-matching.
翻译:离散扩散或流模型相比自回归模型能够实现更快、更可控的序列生成。我们证明,单纯形上的朴素线性流匹配无法实现这一目标,因为它存在训练目标的不连续性及其他病态现象。为克服这一问题,我们基于狄利克雷分布的混合概率路径,在单纯形上发展了狄利克雷流匹配方法。在该框架中,我们推导了混合分布的得分与流向量场之间的联系,从而支持分类器引导和无分类器引导。此外,我们提出了精馏式狄利克雷流匹配,能够在性能损失最小的情况下实现单步序列生成,相比自回归模型取得了$O(L)$倍的加速。在复杂DNA序列生成任务中,我们在分布度量指标和生成序列满足设计目标方面的表现均优于所有基线方法。最后,我们证明无分类器引导方法能改善无条件生成,并有效生成满足设计目标的DNA序列。代码见https://github.com/HannesStark/dirichlet-flow-matching。