A widely used technique for improving policies is success conditioning, in which one collects trajectories, identifies those that achieve a desired outcome, and updates the policy to imitate the actions taken along successful trajectories. This principle appears under many names -- rejection sampling with SFT, goal-conditioned RL, Decision Transformers -- yet what optimization problem it solves, if any, has remained unclear. We prove that success conditioning exactly solves a trust-region optimization problem, maximizing policy improvement subject to a $χ^2$ divergence constraint whose radius is determined automatically by the data. This yields an identity: relative policy improvement, the magnitude of policy change, and a quantity we call action-influence -- measuring how random variation in action choices affects success rates -- are exactly equal at every state. Success conditioning thus emerges as a conservative improvement operator. Exact success conditioning cannot degrade performance or induce dangerous distribution shift, but when it fails, it does so observably, by hardly changing the policy at all. We apply our theory to the common practice of return thresholding, showing this can amplify improvement, but at the cost of potential misalignment with the true objective.
翻译:一种广泛使用的策略改进技术是成功条件化,即收集轨迹,识别那些实现期望结果的轨迹,并更新策略以模仿成功轨迹中的动作。这一原则以多种名称出现——基于SFT的拒绝采样、目标条件化强化学习、决策Transformer——然而,它(如果存在的话)究竟求解了何种优化问题至今仍不清楚。我们证明成功条件化确切地求解了一个信任域优化问题,即在由数据自动确定的χ²散度约束下最大化策略改进。这得出一个恒等式:相对策略改进、策略变化幅度,以及我们称为动作影响(衡量动作选择中的随机变化如何影响成功率的量)的量,在每个状态下均精确相等。因此,成功条件化表现为一个保守的改进算子。精确的成功条件化不会降低性能或引发危险的分布偏移,但当它失效时,其表现是可观测的,即几乎不改变策略。我们将理论应用于常见的回报阈值设定实践,表明这可以放大改进,但代价是可能与真实目标产生偏差。