A simple and effective method for the inference-time alignment of generative models is the best-of-$n$ policy, where $n$ samples are drawn from a reference policy, ranked based on a reward function, and the highest ranking one is selected. A commonly used analytical expression in the literature claims that the KL divergence between the best-of-$n$ policy and the reference policy is equal to $\log (n) - (n-1)/n.$ We disprove the validity of this claim, and show that it is an upper bound on the actual KL divergence. We also explore the tightness of this upper bound in different regimes, and propose a new estimator for the KL divergence and empirically show that it provides a tight approximation. We also show that the win rate of the best-of-$n$ policy against the reference policy is upper bounded by $n/(n+1)$ and derive bounds on the tightness of this characterization. We conclude with analyzing the tradeoffs between win rate and KL divergence of the best-of-$n$ alignment policy, which demonstrate that very good tradeoffs are achievable with $n < 1000$.
翻译:生成模型推理时对齐的一种简单有效方法是最佳-$n$策略,即从参考策略中抽取$n$个样本,根据奖励函数进行排序,并选择排名最高的样本。文献中常用的一个解析表达式声称,最佳-$n$策略与参考策略之间的KL散度等于$\log (n) - (n-1)/n$。我们证明该论断不成立,并指出其实际为真实KL散度的上界。我们进一步探究了该上界在不同机制下的紧致性,提出了一种新的KL散度估计器,并通过实验证明其能提供紧致的近似。我们还证明最佳-$n$策略相对于参考策略的胜率上界为$n/(n+1)$,并推导了该表征紧致性的边界。最后,我们分析了最佳-$n$对齐策略在胜率与KL散度之间的权衡关系,结果表明当$n < 1000$时即可实现优异的权衡效果。