Risk-aware Reinforcement Learning (RL) algorithms like SAC and TD3 were shown empirically to outperform their risk-neutral counterparts in a variety of continuous-action tasks. However, the theoretical basis for the pessimistic objectives these algorithms employ remains unestablished, raising questions about the specific class of policies they are implementing. In this work, we apply the expected utility hypothesis, a fundamental concept in economics, to illustrate that both risk-neutral and risk-aware RL goals can be interpreted through expected utility maximization using an exponential utility function. This approach reveals that risk-aware policies effectively maximize value certainty equivalent, aligning them with conventional decision theory principles. Furthermore, we propose Dual Actor-Critic (DAC). DAC is a risk-aware, model-free algorithm that features two distinct actor networks: a pessimistic actor for temporal-difference learning and an optimistic actor for exploration. Our evaluations of DAC across various locomotion and manipulation tasks demonstrate improvements in sample efficiency and final performance. Remarkably, DAC, while requiring significantly less computational resources, matches the performance of leading model-based methods in the complex dog and humanoid domains.
翻译:经验研究表明,在各类连续动作任务中,SAC、TD3等风险感知强化学习算法相较于风险中性算法表现出更优性能。然而,这些算法所采用的悲观目标函数尚未建立理论基础,引发了对其实施策略具体类别的质疑。本研究应用经济学中的期望效用假说这一基本概念,通过指数型效用函数的期望效用最大化框架,阐释了风险中性与风险感知强化学习目标的内在一致性。该方法揭示出风险感知策略实质上是在最大化价值确定性等价,从而使其与传统决策理论原则相统一。此外,我们提出了双Actor-Critic算法。该算法是一种风险感知的无模型算法,其核心特征在于包含两个独立的行动者网络:用于时序差分学习的悲观行动者与用于探索的乐观行动者。我们在多种运动控制与操作任务中对DAC进行评估,结果表明其在样本效率和最终性能方面均有提升。值得注意的是,在计算资源消耗显著降低的情况下,DAC在复杂的四足机器人及仿人机器人领域中达到了领先的基于模型方法的性能水平。