Controlling generative models is computationally expensive. This is because optimal alignment with a reward function--whether via inference-time steering or fine-tuning--requires estimating the value function. This task demands access to the conditional posterior $p_{1|t}(x_1|x_t)$, the distribution of clean data $x_1$ consistent with an intermediate state $x_t$, a requirement that typically compels methods to resort to costly trajectory simulations. To address this bottleneck, we introduce Meta Flow Maps (MFMs), a framework extending consistency models and flow maps into the stochastic regime. MFMs are trained to perform stochastic one-step posterior sampling, generating arbitrarily many i.i.d. draws of clean data $x_1$ from any intermediate state. Crucially, these samples provide a differentiable reparametrization that unlocks efficient value function estimation. We leverage this capability to solve bottlenecks in both paradigms: enabling inference-time steering without inner rollouts, and facilitating unbiased, off-policy fine-tuning to general rewards. Empirically, our single-particle steered-MFM sampler outperforms a Best-of-1000 baseline on ImageNet across multiple rewards at a fraction of the compute.
翻译:控制生成模型的计算成本高昂。这是因为与奖励函数的最优对齐——无论是通过推理时引导还是微调——都需要估计价值函数。该任务要求访问条件后验分布$p_{1|t}(x_1|x_t)$,即与中间状态$x_t$一致的干净数据$x_1$的分布,这一要求通常迫使方法依赖昂贵的轨迹模拟。为解决这一瓶颈,我们引入了元流映射(MFMs),该框架将一致性模型和流映射扩展至随机性范畴。MFMs经过训练可执行随机单步后验采样,从任意中间状态生成任意多个独立同分布的干净数据$x_1$样本。关键的是,这些样本提供了可微分的重参数化方法,从而实现了高效的价值函数估计。我们利用这一能力解决了两类范式中的瓶颈:实现无需内部展开的推理时引导,并促进面向通用奖励的无偏、离策略微调。实验表明,在ImageNet数据集上,我们的单粒子引导MFM采样器在多种奖励函数上以极低计算成本超越了1000选1基线方法。