Risk-sensitive reinforcement learning (RL) aims to optimize policies that balance the expected reward and risk. In this paper, we investigate a novel risk-sensitive RL formulation with an Iterated Conditional Value-at-Risk (CVaR) objective under linear and general function approximations. This new formulation, named ICVaR-RL with function approximation, provides a principled way to guarantee safety at each decision step. For ICVaR-RL with linear function approximation, we propose a computationally efficient algorithm ICVaR-L, which achieves an $\widetilde{O}(\sqrt{\alpha^{-(H+1)}(d^2H^4+dH^6)K})$ regret, where $\alpha$ is the risk level, $d$ is the dimension of state-action features, $H$ is the length of each episode, and $K$ is the number of episodes. We also establish a matching lower bound $\Omega(\sqrt{\alpha^{-(H-1)}d^2K})$ to validate the optimality of ICVaR-L with respect to $d$ and $K$. For ICVaR-RL with general function approximation, we propose algorithm ICVaR-G, which achieves an $\widetilde{O}(\sqrt{\alpha^{-(H+1)}DH^4K})$ regret, where $D$ is a dimensional parameter that depends on the eluder dimension and covering number. Furthermore, our analysis provides several novel techniques for risk-sensitive RL, including an efficient approximation of the CVaR operator, a new ridge regression with CVaR-adapted features, and a refined elliptical potential lemma.
翻译:风险敏感型强化学习旨在优化兼顾期望收益与风险平衡的策略。本文研究基于线性与一般函数逼近框架下采用迭代条件风险值(ICVaR)目标的新型风险敏感强化学习建模。我们提出的ICVaR-RL函数逼近方法为每个决策步骤的安全性保障提供了可解释的理论框架。针对线性函数逼近场景,我们设计了高效算法ICVaR-L,其遗憾界为$\widetilde{O}(\sqrt{\alpha^{-(H+1)}(d^2H^4+dH^6)K})$(其中$\alpha$为风险水平,$d$为状态-动作特征维度,$H$为每个回合长度,$K$为总回合数)。我们还建立了匹配的下界$\Omega(\sqrt{\alpha^{-(H-1)}d^2K})$,验证了ICVaR-L在$d$和$K$参数上的最优性。针对一般函数逼近场景,提出的ICVaR-G算法实现了$\widetilde{O}(\sqrt{\alpha^{-(H+1)}DH^4K})$的遗憾界($D$为依赖eluder维度与覆盖数的维数参数)。此外,我们的分析为风险敏感强化学习贡献了多项创新技术:CVaR算子的高效逼近方法、基于CVaR自适应特征的岭回归新框架,以及精炼的椭圆势引理。