Inference-time reward alignment asks how to turn a pre-trained diffusion model with base law $p$ into a sampler that favors a reward $r$ while remaining close to $p$. Since there is no canonical distributional distance for this closeness constraint, different choices lead to different "reward-aligned" laws and, just as importantly, different algorithmic problems. We develop a primitive-based approach to reward alignment: rather than assuming arbitrary reward-aligned laws can be sampled, we ask which simple algorithmic primitives suffice to implement alignment for non-trivial reward classes. If closeness is measured in KL distance, the target law is $q(x) \propto p(x) \exp(λ^{-1}r(x))$. For this setting, we show that linear exponential tilts of the form $q(x)\propto p(x)\exp(\langle θ, x \rangle)$ -- which according to recent work [MRR26] can be efficiently sampled from -- are a sufficient primitive for aligning to a very broad class of convex low-dimensional rewards. If closeness is measured in Wasserstein distance, the corresponding primitive is a proximal transport oracle: given $x$, solve $\mbox{argmax}_y \{r(y)- λc(x,y)\}$. This oracle can be efficiently implemented for concave or low-dimensional Lipschitz rewards $r(x)=f(Ax)$. Together, these results illustrate that the choice of distribution distance for alignment affects the computational primitive and the tractable reward class.
翻译:推理时奖励对齐探讨如何将预训练的扩散模型(基础分布为$p$)转化为偏好奖励$r$且保持与$p$接近的采样器。由于该接近约束缺乏规范的分布距离定义,不同的选择会导致不同的"奖励对齐"分布,同样重要的是,还会产生不同的算法问题。我们提出了一种基于基元的奖励对齐方法:不假设任意奖励对齐分布均可采样,而是探讨对于非平凡奖励类别,哪些简单算法基元足以实现对齐。若接近程度以KL距离衡量,目标分布为$q(x) \propto p(x) \exp(λ^{-1}r(x))$。在该设定下,我们证明形如$q(x)\propto p(x)\exp(\langle θ, x \rangle)$的线性指数倾斜——根据近期工作[MRR26]可高效采样——是对广泛凸低维奖励类别进行对齐的充分基元。若接近程度以Wasserstein距离衡量,相应的基元是近端传输预言机:给定$x$,求解$\mbox{argmax}_y \{r(y)- λc(x,y)\}$。该预言机可对凹性或低维Lipschitz奖励$r(x)=f(Ax)$高效实现。这些结果共同表明,对齐中分布距离的选择将影响计算基元与可解奖励类别。