Standard policy gradients weight each sampled action by advantage alone, regardless of how likely that action was under the current policy. This creates two pathologies: within a single decision context (e.g. one image or prompt), a rare negative-advantage action can disproportionately distort the update direction; across many such contexts in a batch, the expected gradient over-allocates budget to contexts the policy already handles well. We introduce the \textit{Delightful Policy Gradient} (DG), which gates each term with a sigmoid of \emph{delight}, the product of advantage and action surprisal (negative log-probability). For $K$-armed bandits, DG provably improves directional accuracy in a single context and, across multiple contexts, shifts the expected gradient strictly closer to the supervised cross-entropy oracle. This second effect is not variance reduction: it persists even with infinite samples. Empirically, DG outperforms REINFORCE, PPO, and advantage-weighted baselines across MNIST, transformer sequence modeling, and continuous control, with larger gains on harder tasks.
翻译:标准策略梯度仅依据优势值对每个采样动作进行加权,而不考虑该动作在当前策略下的可能性。这导致两种弊端:在单一决策情境(例如一张图像或一个提示)中,一个罕见的负优势动作可能不成比例地扭曲更新方向;在批次中的多个此类情境间,期望梯度会将过多预算分配给策略已能妥善处理的情境。我们提出\textit{愉悦策略梯度}(DG),该方法通过一个关于\textit{愉悦度}的sigmoid函数对各项进行门控,愉悦度定义为优势值与动作惊奇度(负对数概率)的乘积。对于$K$臂赌博机问题,DG可证明在单一情境中提升方向准确性,并在多个情境间将期望梯度严格移向监督交叉熵基准方向。第二种效应并非方差缩减:即使在无限样本下该效应依然存在。实验表明,在MNIST、Transformer序列建模及连续控制任务中,DG均优于REINFORCE、PPO及优势加权基线方法,且在更困难任务上提升更为显著。