In this paper, we study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $\tau$. Starting with multi-arm bandits (MABs), we show the minimax CVaR regret rate is $\Omega(\sqrt{\tau^{-1}AK})$, where $A$ is the number of actions and $K$ is the number of episodes, and that it is achieved by an Upper Confidence Bound algorithm with a novel Bernstein bonus. For online RL in tabular Markov Decision Processes (MDPs), we show a minimax regret lower bound of $\Omega(\sqrt{\tau^{-1}SAK})$ (with normalized cumulative rewards), where $S$ is the number of states, and we propose a novel bonus-driven Value Iteration procedure. We show that our algorithm achieves the optimal regret of $\widetilde O(\sqrt{\tau^{-1}SAK})$ under a continuity assumption and in general attains a near-optimal regret of $\widetilde O(\tau^{-1}\sqrt{SAK})$, which is minimax-optimal for constant $\tau$. This improves on the best available bounds. By discretizing rewards appropriately, our algorithms are computationally efficient.
翻译:本文研究风险敏感强化学习(RL),聚焦于风险容限为$\tau$的条件风险价值(CVaR)目标。从多臂老虎机(MAB)问题出发,我们证明了CVaR遗憾的最小最大界为$\Omega(\sqrt{\tau^{-1}AK})$,其中$A$为动作数,$K$为轮次数,并证明该界可由基于新型Bernstein奖励的上置信界算法实现。针对表格型马尔可夫决策过程(MDP)中的在线RL,我们提出了累加奖励归一化下的最小最大遗憾下界$\Omega(\sqrt{\tau^{-1}SAK})$($S$为状态数),并设计了一种新型奖励驱动的值迭代算法。我们证明:在连续性假设下,该算法可实现$\widetilde O(\sqrt{\tau^{-1}SAK})$的最优遗憾;在一般场景下,可获得$\widetilde O(\tau^{-1}\sqrt{SAK})$的渐进最优遗憾——当$\tau$为常数时达到最小最大最优。该结果显著改进了现有最优界。通过对奖励进行适当离散化,我们的算法具有计算高效性。