Offline reinforcement learning (RL) is suitable for safety-critical domains where online exploration is not feasible. In such domains, decision-making should take into consideration the risk of catastrophic outcomes. In other words, decision-making should be risk-averse. An additional challenge of offline RL is avoiding distributional shift, i.e. ensuring that state-action pairs visited by the policy remain near those in the dataset. Previous works on risk in offline RL combine offline RL techniques (to avoid distributional shift), with risk-sensitive RL algorithms (to achieve risk-aversion). In this work, we propose risk-aversion as a mechanism to jointly address both of these issues. We propose a model-based approach, and use an ensemble of models to estimate epistemic uncertainty, in addition to aleatoric uncertainty. We train a policy that is risk-averse, and avoids high uncertainty actions. Risk-aversion to epistemic uncertainty prevents distributional shift, as areas not covered by the dataset have high epistemic uncertainty. Risk-aversion to aleatoric uncertainty discourages actions that are inherently risky due to environment stochasticity. Thus, by only introducing risk-aversion, we avoid distributional shift in addition to achieving risk-aversion to aleatoric risk. Our algorithm, 1R2R, achieves strong performance on deterministic benchmarks, and outperforms existing approaches for risk-sensitive objectives in stochastic domains.
翻译:离线强化学习适用于在线探索不可行的安全关键领域。在此类领域中,决策需考虑灾难性后果的风险,即应采取风险规避策略。离线强化学习的另一挑战是避免分布偏移,即确保策略所访问的状态-动作对与数据集中的状态-动作对保持相近。以往关于离线强化学习中风险的研究,结合了离线强化学习技术(避免分布偏移)与风险敏感强化学习算法(实现风险规避)。本文提出将风险规避作为同时解决这两个问题的机制。我们采用基于模型的方法,除偶然不确定性外,还使用模型集成来估计认知不确定性。我们训练一个风险规避策略,避免高不确定性动作。对认知不确定性的规避可防止分布偏移,因为数据集未覆盖的区域具有高认知不确定性;对偶然不确定性的规避则抑制因环境随机性而固有风险的动作。因此,仅通过引入风险规避,我们既实现了对偶然风险的规避,又避免了分布偏移。我们的算法1R2R在确定性基准测试中表现优异,并在随机领域超越现有风险敏感目标方法。