Standard trust-region methods constrain policy updates via Kullback-Leibler (KL) divergence. However, KL controls only an average divergence and does not directly prevent rare, large likelihood-ratio excursions that destabilize training--precisely the failure mode that motivates heuristics such as PPO's clipping. We propose overlap geometry as an alternative trust region, constraining distributional overlap via the Bhattacharyya coefficient (closely related to the Hellinger/Renyi-1/2 geometry). This objective penalizes separation in the ratio tails, yielding tighter control over likelihood-ratio excursions without relying on total variation bounds that can be loose in tail regimes. We derive Bhattacharyya-TRPO (BTRPO) and Bhattacharyya-PPO (BPPO), enforcing overlap constraints via square-root ratio updates: BPPO clips the square-root ratio q = sqrt(r), and BTRPO applies a quadratic Hellinger/Bhattacharyya penalty. Empirically, overlap-based updates improve robustness and aggregate performance as measured by RLiable under matched training budgets, suggesting overlap constraints as a practical, principled alternative to KL for stable policy optimization.
翻译:标准的信任区域方法通过Kullback-Leibler(KL)散度来约束策略更新。然而,KL仅控制平均散度,无法直接防止罕见但大幅度的似然比偏移——这种偏移会破坏训练稳定性,而这正是推动PPO裁剪等启发式方法出现的失效模式。我们提出以重叠几何作为替代性信任区域,通过Bhattacharyya系数(与Hellinger/Renyi-1/2几何密切相关)来约束分布重叠。该目标函数对比率尾部分离进行惩罚,从而对似然比偏移实现更严格的控制,且无需依赖在尾部区域可能较为宽松的全变差界。我们推导出Bhattacharyya-TRPO(BTRPO)与Bhattacharyya-PPO(BPPO),通过平方根比率更新来实施重叠约束:BPPO对平方根比率 q = sqrt(r) 进行裁剪,而BTRPO则应用二次Hellinger/Bhattacharyya惩罚。实验表明,在匹配的训练预算下,基于重叠的更新在RLiable评估下提升了鲁棒性与综合性能,这提示重叠约束可作为KL的一种实用且原理清晰的替代方案,用于实现稳定的策略优化。