Flow and diffusion models produce high-quality samples, but adapting them to user preferences or constraints post-training remains costly and brittle, a challenge commonly called reward alignment. We argue that efficient reward alignment should be a property of the generative model itself, not an afterthought, and redesign the model for adaptability. We propose "Diamond Maps", stochastic flow map models that enable efficient and accurate alignment to arbitrary rewards at inference time. Diamond Maps amortize many simulation steps into a single-step sampler, like flow maps, while preserving the stochasticity required for optimal reward alignment. This design makes search, sequential Monte Carlo, and guidance scalable by enabling efficient and consistent estimation of the value function. Our experiments show that Diamond Maps can be learned efficiently via distillation from GLASS Flows, achieve stronger reward alignment performance, and scale better than existing methods. Our results point toward a practical route to generative models that can be rapidly adapted to arbitrary preferences and constraints at inference time.
翻译:流模型与扩散模型能够生成高质量样本,但训练后使其适应用户偏好或约束仍然成本高昂且脆弱,这一挑战通常被称为奖励对齐。我们认为高效奖励对齐应是生成模型本身的性质而非事后补救,并为此重新设计了具有适应性的模型。我们提出“钻石映射”——一种随机流映射模型,能够在推理阶段实现对任意奖励函数的高效精确对齐。钻石映射将多步模拟摊销为单步采样器(如流映射),同时保留了最优奖励对齐所需的随机性。该设计通过实现价值函数的高效一致估计,使搜索、序列蒙特卡洛和引导方法具备可扩展性。实验表明,钻石映射可通过从GLASS流中蒸馏高效学习,获得更强的奖励对齐性能,且比现有方法具有更好的扩展性。我们的研究结果为构建能够在推理时快速适应任意偏好与约束的生成模型提供了实用路径。