Generative modeling over discrete data has recently seen numerous success stories, with applications spanning language modeling, biological sequence design, and graph-structured molecular data. The predominant generative modeling paradigm for discrete data is still autoregressive, with more recent alternatives based on diffusion or flow-matching falling short of their impressive performance in continuous data settings, such as image or video generation. In this work, we introduce Fisher-Flow, a novel flow-matching model for discrete data. Fisher-Flow takes a manifestly geometric perspective by considering categorical distributions over discrete data as points residing on a statistical manifold equipped with its natural Riemannian metric: the $\textit{Fisher-Rao metric}$. As a result, we demonstrate discrete data itself can be continuously reparameterised to points on the positive orthant of the $d$-hypersphere $\mathbb{S}^d_+$, which allows us to define flows that map any source distribution to target in a principled manner by transporting mass along (closed-form) geodesics of $\mathbb{S}^d_+$. Furthermore, the learned flows in Fisher-Flow can be further bootstrapped by leveraging Riemannian optimal transport leading to improved training dynamics. We prove that the gradient flow induced by Fisher-Flow is optimal in reducing the forward KL divergence. We evaluate Fisher-Flow on an array of synthetic and diverse real-world benchmarks, including designing DNA Promoter, and DNA Enhancer sequences. Empirically, we find that Fisher-Flow improves over prior diffusion and flow-matching models on these benchmarks.
翻译:离散数据生成建模近年来取得诸多成功,其应用涵盖语言建模、生物序列设计和图结构分子数据。当前离散数据生成建模的主流范式仍为自回归方法,而基于扩散或流匹配的替代方案虽在连续数据场景(如图像或视频生成)中表现卓越,但在离散数据领域尚未达到同等水平。本研究提出费舍尔流——一种面向离散数据的新型流匹配模型。费舍尔流采用显式几何视角,将离散数据的分类分布视为统计流形上的点,该流形配备其自然黎曼度量:$\textit{费舍尔-拉奥度量}$。基于此,我们证明离散数据本身可连续重参数化为$d$维超球面正象限$\mathbb{S}^d_+$上的点,这使我们能够通过沿$\mathbb{S}^d_+$的(闭式)测地线传输质量,以原则性方式定义将任意源分布映射至目标分布的流。此外,费舍尔流中习得的流可通过黎曼最优传输进行自举优化,从而改善训练动态。我们证明费舍尔流诱导的梯度流在降低前向KL散度方面具有最优性。我们在合成数据及多样化现实基准测试中评估费舍尔流,包括DNA启动子与DNA增强子序列设计。实证研究表明,费舍尔流在这些基准测试中优于先前的扩散模型与流匹配模型。