We derive a policy gradient theorem for Cumulative Prospect Theory (CPT) objectives in finite-horizon Reinforcement Learning (RL), generalizing the standard policy gradient theorem and encompassing distortion-based risk objectives as special cases. Motivated by behavioral economics, CPT combines an asymmetric utility transformation around a reference point with probability distortion. Building on our theorem, we design a first-order policy gradient algorithm for CPT-RL using a Monte Carlo gradient estimator based on order statistics. We establish statistical guarantees for the estimator and prove asymptotic convergence of the resulting algorithm to first-order stationary points of the (generally non-convex) CPT objective. Simulations illustrate qualitative behaviors induced by CPT and compare our first-order approach to existing zeroth-order methods.
翻译:本文推导了有限时域强化学习(RL)中累积前景理论(CPT)目标的策略梯度定理,该定理推广了标准策略梯度定理,并将基于扭曲的风险目标作为特例包含在内。受行为经济学启发,CPT结合了参考点附近的不对称效用变换与概率扭曲。基于该定理,我们设计了一种基于顺序统计量的蒙特卡洛梯度估计器,用于CPT-RL的一阶策略梯度算法。我们为该估计器建立了统计保证,并证明了所得算法对(通常为非凸的)CPT目标一阶稳定点的渐近收敛性。仿真实验展示了CPT诱导的定性行为,并将我们的一阶方法与现有的零阶方法进行了比较。