Functional Critics Are Essential for Actor-Critic: From Off-Policy Stability to Efficient Exploration

The actor-critic (AC) framework has achieved strong empirical success in off-policy reinforcement learning but suffers from the "moving target" problem, where the evaluated policy changes continually. Functional critics, or policy-conditioned value functions, address this by explicitly including a representation of the policy as input. While conceptually appealing, previous efforts have struggled to remain competitive against standard AC. In this work, we revisit functional critics within the actor-critic framework and identify two critical aspects that render them a necessity rather than a luxury. First, we demonstrate their power in stabilizing the complex interplay between the "deadly triad" and the "moving target". We provide a convergent off-policy AC algorithm under linear functional approximation that dismantles several longstanding barriers between theory and practice: it utilizes target-based TD learning, accommodates dynamic behavior policies, and operates without the restrictive "full coverage" assumptions. By formalizing a dual trust-coverage mechanism, our framework provides principled guidelines for pursuing sample efficiency-rigorously governing behavior policy updates and critic re-evaluations to maximize off-policy data utility. Second, we uncover a foundational link between functional critics and efficient exploration. We demonstrate that existing model-free approximations of posterior sampling are limited in capturing policy-dependent uncertainty, a gap the functional critic formalism bridges. These results represent, to our knowledge, first-of-their-kind contributions to the RL literature. Practically, we propose a tailored neural network architecture and a minimalist AC algorithm. In preliminary experiments on the DeepMind Control Suite, this implementation achieves performance competitive with state-of-the-art methods without standard implementation heuristics.

翻译：演员-评论家（AC）框架在离策略强化学习中取得了显著的实证成功，但长期受困于"移动目标"问题——即被评估策略持续变化。功能评论家（或称策略条件价值函数）通过将策略表示显式纳入输入来解决此问题。尽管概念上颇具吸引力，先前研究始终难以使其在与标准AC的竞争中保持优势。本文在演员-评论家框架内重新审视功能评论家，揭示出使其成为必要而非可选组件的两个关键维度。首先，我们论证了其在稳定"致命三角"与"移动目标"复杂交互中的核心作用。在线性函数逼近条件下，我们提出了一种收敛的离策略AC算法，该算法打破了理论与实践中长期存在的多重壁垒：采用基于目标的时序差分学习、兼容动态行为策略、且无需严格的"完全覆盖"假设。通过形式化双重信任-覆盖机制，我们的框架为追求样本效率提供了原则性指导——严格规范行为策略更新与评论家重评估过程，以最大化离策略数据效用。其次，我们发现了功能评论家与高效探索之间的本质关联。现有后验采样的无模型近似方法在捕捉策略依赖性不确定性方面存在局限，而功能评论家形式体系恰好填补了这一空白。据我们所知，这些成果在强化学习文献中均属首创。实践层面，我们提出了定制化的神经网络架构与极简AC算法。在DeepMind控制套件的初步实验中，该实现无需标准启发式技巧即能达到与最先进方法相当的性能水平。