Policies learned via continuous actor-critic methods often exhibit erratic, high-frequency oscillations, making them unsuitable for physical deployment. Current approaches attempt to enforce smoothness by directly regularizing the policy's output. We argue that this approach treats the symptom rather than the cause. In this work, we theoretically establish that policy non-smoothness is fundamentally governed by the differential geometry of the critic. By applying implicit differentiation to the actor-critic objective, we prove that the sensitivity of the optimal policy is bounded by the ratio of the Q-function's mixed-partial derivative (noise sensitivity) to its action-space curvature (signal distinctness). To empirically validate this theoretical insight, we introduce PAVE (Policy-Aware Value-field Equalization), a critic-centric regularization framework that treats the critic as a scalar field and stabilizes its induced action-gradient field. PAVE rectifies the learning signal by minimizing the Q-gradient volatility while preserving local curvature. Experimental results demonstrate that PAVE achieves smoothness and robustness comparable to policy-side smoothness regularization methods, while maintaining competitive task performance, without modifying the actor.
翻译:通过连续型Actor-Critic方法学习得到的策略常表现出不稳定、高频的振荡特性,使其难以部署于物理系统。现有方法试图通过直接正则化策略输出来强制实现平滑性。我们认为这种方法仅处理了表象而非根本原因。本文从理论上证明,策略的非平滑性本质上由评论家函数的微分几何特性所决定。通过对Actor-Critic目标函数应用隐函数微分法,我们证明了最优策略的敏感度受限于Q函数混合偏导数(噪声敏感度)与其动作空间曲率(信号区分度)的比值。为实证验证这一理论见解,我们提出了PAVE(策略感知价值场均衡化)——一种以评论家为核心的正则化框架,该方法将评论家视为标量场并稳定其诱导的动作梯度场。PAVE通过最小化Q梯度波动性同时保持局部曲率来修正学习信号。实验结果表明,在不修改行动者网络的前提下,PAVE实现的平滑性与鲁棒性可与策略侧平滑正则化方法相媲美,同时保持具有竞争力的任务性能。