Diffusion and flow matching approaches to generative modeling have shown promise in domains where the state space is continuous, such as image generation or protein folding & design, and discrete, exemplified by diffusion large language models. They offer a natural fit when the number of elements in a state is fixed in advance (e.g. images), but require ad hoc solutions when, for example, the length of a response from a large language model, or the number of amino acids in a protein chain is not known a priori. Here we propose Branching Flows, a generative modeling framework that, like diffusion and flow matching approaches, transports a simple distribution to the data distribution. But in Branching Flows, the elements in the state evolve over a forest of binary trees, branching and dying stochastically with rates that are learned by the model. This allows the model to control, during generation, the number of elements in the sequence. We also show that Branching Flows can compose with any flow matching base process on discrete sets, continuous Euclidean spaces, smooth manifolds, and `multimodal' product spaces that mix these components. We demonstrate this in three domains: small molecule generation (multimodal), antibody sequence generation (discrete), and protein backbone generation (multimodal), and show that Branching Flows is a capable distribution learner with a stable learning objective, and that it enables new capabilities.
翻译:扩散模型与流匹配等生成式建模方法,在状态空间为连续(如图像生成或蛋白质折叠与设计)和离散(如扩散大语言模型)的领域中展现出优势。当状态中元素数量预先固定时(如图像),这些方法自然适用,但当大语言模型回答长度或蛋白质链中氨基酸数量无法预先确定时,则需要采用特设解决方案。本文提出分支流(Branching Flows)这一生成式建模框架,其与扩散模型和流匹配方法类似,将简单分布传输至数据分布。但在分支流中,状态元素在二叉树森林上演化,以模型学习的速率随机分支与消亡。这使模型在生成过程中能够控制序列元素的数量。我们还证明,分支流可与任意流匹配基过程在离散集、连续欧氏空间、光滑流形及混合上述组件的“多模态”乘积空间上组合。我们在三个领域验证了该方法:小分子生成(多模态)、抗体序列生成(离散)及蛋白质骨架生成(多模态),结果表明分支流是具有稳定学习目标的优秀分布学习器,并赋予了模型新的能力。