We introduce Discrete flow Matching policy Optimization (DoMinO), a unified framework for Reinforcement Learning (RL) fine-tuning Discrete Flow Matching (DFM) models under a broad class of policy gradient methods. Our key idea is to view the DFM sampling procedure as a multi-step Markov Decision Process. This perspective provides a simple and transparent reformulation of fine-tuning reward maximization as a robust RL objective. Consequently, it not only preserves the original DFM samplers but also avoids biased auxiliary estimators and likelihood surrogates used by many prior RL fine-tuning methods. To prevent policy collapse, we also introduce new total-variation regularizers to keep the fine-tuned distribution close to the pretrained one. Theoretically, we establish an upper bound on the discretization error of DoMinO and tractable upper bounds for the regularizers. Experimentally, we evaluate DoMinO on regulatory DNA sequence design. DoMinO achieves stronger predicted enhancer activity and better sequence naturalness than the previous best reward-driven baselines. The regularization further improves alignment with the natural sequence distribution while preserving strong functional performance. These results establish DoMinO as an useful framework for controllable discrete sequence generation.
翻译:我们提出离散流匹配策略优化(DoMinO),这是一个在广义策略梯度方法框架下对离散流匹配(DFM)模型进行强化学习(RL)微调的统一框架。其核心思想是将DFM采样过程视为多步马尔可夫决策过程。这一视角将微调奖励最大化问题简洁透明地重构为鲁棒RL目标。由此,该方法不仅保留了原始DFM采样器,还避免了先前诸多RL微调方法使用的有偏辅助估计与似然替代项。为防止策略崩溃,我们引入了新的全变差正则化项,使微调后的分布保持接近预训练分布。理论上,我们建立了DoMinO离散化误差的上界,并给出了正则化项的可计算上界。实验方面,我们在调控DNA序列设计任务上评估了DoMinO。相比此前最优的奖励驱动基线方法,DoMinO实现了更强的预测增强子活性及更优的序列自然度。正则化项在保持强功能性能的同时进一步提升了对自然序列分布的对齐效果。这些结果表明DoMinO为可控离散序列生成建立了有效框架。