A widely used technique for improving policies is success conditioning, in which one collects trajectories, identifies those that achieve a desired outcome, and updates the policy to imitate the actions taken along successful trajectories. This principle appears under many names -- rejection sampling with SFT, goal-conditioned RL, Decision Transformers -- yet what optimization problem it solves, if any, has remained unclear. We prove that success conditioning exactly solves a trust-region optimization problem, maximizing policy improvement subject to a $χ^2$ divergence constraint whose radius is determined automatically by the data. This yields an identity: relative policy improvement, the magnitude of policy change, and a quantity we call action-influence -- measuring how random variation in action choices affects success rates -- are exactly equal at every state. Success conditioning thus emerges as a conservative improvement operator. Exact success conditioning cannot degrade performance or induce dangerous distribution shift, but when it fails, it does so observably, by hardly changing the policy at all. We apply our theory to the common practice of return thresholding, showing this can amplify improvement, but at the cost of potential misalignment with the true objective.
翻译:一种广泛使用的策略改进技术是成功条件化,该方法通过收集轨迹、识别其中达成预期结果的轨迹,并更新策略以模仿成功轨迹中采取的行动来实现。这一原理以多种名称出现——基于SFT的拒绝采样、目标条件强化学习、决策Transformer——然而其所解决的优化问题(如果存在)始终未得到明确阐释。我们证明成功条件化精确求解了一个信赖域优化问题:在由数据自动确定半径的$χ^2$散度约束下最大化策略改进。由此导出一个恒等式:相对策略改进量、策略变化幅度以及我们称为“行动影响力”的量度(衡量行动选择的随机变异如何影响成功率)在每一状态下严格相等。因此成功条件化可视为一种保守的改进算子。精确的成功条件化不会降低性能或引发危险的分布偏移,但当其失效时,会通过几乎不改变策略的方式显现出可观测的失效特征。我们将理论应用于常见的回报阈值设定实践,证明该方法能放大改进效果,但代价是可能与真实目标产生潜在偏差。