Risk-aware Reinforcement Learning (RL) algorithms like SAC and TD3 were shown empirically to outperform their risk-neutral counterparts in a variety of continuous-action tasks. However, the theoretical basis for the pessimistic objectives these algorithms employ remains unestablished, raising questions about the specific class of policies they are implementing. In this work, we apply the expected utility hypothesis, a fundamental concept in economics, to illustrate that both risk-neutral and risk-aware RL goals can be interpreted through expected utility maximization using an exponential utility function. This approach reveals that risk-aware policies effectively maximize value certainty equivalent, aligning them with conventional decision theory principles. Furthermore, we propose Dual Actor-Critic (DAC). DAC is a risk-aware, model-free algorithm that features two distinct actor networks: a pessimistic actor for temporal-difference learning and an optimistic actor for exploration. Our evaluations of DAC across various locomotion and manipulation tasks demonstrate improvements in sample efficiency and final performance. Remarkably, DAC, while requiring significantly less computational resources, matches the performance of leading model-based methods in the complex dog and humanoid domains.
翻译:风险感知强化学习算法(如SAC及TD3)已在多种连续动作任务中经验性地证明其性能优于风险中性算法。然而,这些算法所采用的悲观目标函数的理论基础尚未明确,由此引发了对其实施策略具体类别之疑问。本研究运用经济学中的基本概念——期望效用假说,阐释了风险中性与风险感知强化学习目标均可通过指数效用函数下的期望效用最大化进行解读。该方法揭示出风险感知策略实质上是在最大化价值确定等价值,使其与经典决策理论原则相一致。此外,我们提出了双演员-评论家(DAC)算法。DAC是一种无模型风险感知算法,包含两个独立演员网络:用于时序差分学习的悲观演员网络与用于探索的乐观演员网络。我们在多种运动控制与操作任务上的评估表明,DAC在样本效率与最终性能上均有提升。值得注意的是,DAC在显著减少计算资源需求的同时,在复杂狗型与人形机器人领域的性能与领先的基于模型的方法持平。