We study the regret guarantee for risk-sensitive reinforcement learning (RSRL) via distributional reinforcement learning (DRL) methods. In particular, we consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return. By leveraging a key property of the EntRM, the independence property, we establish the risk-sensitive distributional dynamic programming framework. We then propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one. We prove that they both attain $\tilde{\mathcal{O}}(\frac{\exp(|\beta| H)-1}{|\beta|}H\sqrt{S^2AK})$ regret upper bound, where $S$, $A$, $K$, and $H$ represent the number of states, actions, episodes, and the time horizon, respectively. It matches RSVI2 proposed in \cite{fei2021exponential}, with novel distributional analysis. To the best of our knowledge, this is the first regret analysis that bridges DRL and RSRL in terms of sample complexity. Acknowledging the computational inefficiency associated with the model-free DRL algorithm, we propose an alternative DRL algorithm with distribution representation. This approach not only maintains the established regret bounds but also significantly amplifies computational efficiency. We also prove a tighter minimax lower bound of $\Omega(\frac{\exp(\beta H/6)-1}{\beta H}H\sqrt{SAT})$ for the $\beta>0$ case, which recovers the tight lower bound $\Omega(H\sqrt{SAT})$ in the risk-neutral setting.
翻译:我们通过分布强化学习方法研究了风险敏感强化学习的遗憾保证。具体而言,我们考虑具有有限回合的马尔可夫决策过程,其目标为收益的熵风险度量。通过利用熵风险度量的关键性质——独立性性质,我们建立了风险敏感分布动态规划框架。随后,我们提出两种新颖的分布强化学习算法,分别通过无模型和基于模型两种方案实现乐观性。我们证明这两种算法均能达到 $\tilde{\mathcal{O}}(\frac{\exp(|\beta| H)-1}{|\beta|}H\sqrt{S^2AK})$ 的遗憾上界,其中 $S$、$A$、$K$ 和 $H$ 分别表示状态数、动作数、回合数和时间范围。该结果与文献[fei2021exponential]提出的 RSVI2 相匹配,并采用了新颖的分布分析。据我们所知,这是首次在样本复杂度方面桥接分布强化学习与风险敏感强化学习的遗憾分析。针对无模型分布强化学习算法存在的计算效率低下问题,我们提出了一种基于分布表示的替代分布强化学习算法。该方法不仅保持了已建立的遗憾界,还显著提升了计算效率。此外,我们针对 $\beta>0$ 的情况证明了更紧的极小化最优下界 $\Omega(\frac{\exp(\beta H/6)-1}{\beta H}H\sqrt{SAT})$,该下界在风险中性设置下恢复为紧下界 $\Omega(H\sqrt{SAT})$。