In this paper, we study a novel episodic risk-sensitive Reinforcement Learning (RL) problem, named Iterated CVaR RL, which aims to maximize the tail of the reward-to-go at each step, and focuses on tightly controlling the risk of getting into catastrophic situations at each stage. This formulation is applicable to real-world tasks that demand strong risk avoidance throughout the decision process, such as autonomous driving, clinical treatment planning and robotics. We investigate two performance metrics under Iterated CVaR RL, i.e., Regret Minimization and Best Policy Identification. For both metrics, we design efficient algorithms ICVaR-RM and ICVaR-BPI, respectively, and provide nearly matching upper and lower bounds with respect to the number of episodes $K$. We also investigate an interesting limiting case of Iterated CVaR RL, called Worst Path RL, where the objective becomes to maximize the minimum possible cumulative reward. For Worst Path RL, we propose an efficient algorithm with constant upper and lower bounds. Finally, our techniques for bounding the change of CVaR due to the value function shift and decomposing the regret via a distorted visitation distribution are novel, and can find applications in other risk-sensitive RL problems.
翻译:摘要:本文研究了一种新颖的基于情节的风险敏感强化学习问题,称为迭代条件风险价值(Iterated CVaR)强化学习,其目标是在每一步最大化未来收益的尾部,并着重于在每个阶段严格控制陷入灾难性情境的风险。该公式适用于整个决策过程中要求强风险规避的真实任务,例如自动驾驶、临床治疗规划和机器人技术。我们探讨了迭代CVaR强化学习下的两种性能指标,即遗憾最小化和最优策略识别。针对这两种指标,我们分别设计了高效算法ICVaR-RM和ICVaR-BPI,并提供了关于情节数$K$的几乎匹配的上下界。我们还研究了迭代CVaR强化学习的一个有趣极限情况,称为最坏路径强化学习,其目标变为最大化可能的最小累积回报。对于最坏路径强化学习,我们提出了一种具有常数上下界的高效算法。最后,我们用于因值函数偏移导致的CVaR变化定界以及通过扭曲访问分布分解遗憾的技术是新颖的,可应用于其他风险敏感强化学习问题。