We study risk-sensitive reinforcement learning (RL), a crucial field due to its ability to enhance decision-making in scenarios where it is essential to manage uncertainty and minimize potential adverse outcomes. Particularly, our work focuses on applying the entropic risk measure to RL problems. While existing literature primarily investigates the online setting, there remains a large gap in understanding how to efficiently derive a near-optimal policy based on this risk measure using only a pre-collected dataset. We center on the linear Markov Decision Process (MDP) setting, a well-regarded theoretical framework that has yet to be examined from a risk-sensitive standpoint. In response, we introduce two provably sample-efficient algorithms. We begin by presenting a risk-sensitive pessimistic value iteration algorithm, offering a tight analysis by leveraging the structure of the risk-sensitive performance measure. To further improve the obtained bounds, we propose another pessimistic algorithm that utilizes variance information and reference-advantage decomposition, effectively improving both the dependence on the space dimension $d$ and the risk-sensitivity factor. To the best of our knowledge, we obtain the first provably efficient risk-sensitive offline RL algorithms.
翻译:我们研究风险敏感强化学习(RL),这是一个关键领域,因其能够增强在必须管理不确定性并最小化潜在不利后果的场景中的决策能力。特别地,我们的工作聚焦于将熵风险度量应用于RL问题。现有文献主要研究在线设置,而对于如何仅基于预先收集的数据集,高效地推导出基于此风险度量的近似最优策略,仍存在巨大理解空白。我们以线性马尔可夫决策过程(MDP)设置为中心,这是一个备受推崇的理论框架,但尚未从风险敏感的角度进行审视。为此,我们引入了两种可证明样本高效的算法。我们首先提出了一种风险敏感的悲观价值迭代算法,通过利用风险敏感性能度量的结构,提供了紧密的分析。为了进一步改进所获得的界,我们提出了另一种利用方差信息和参考优势分解的悲观算法,有效地改善了其对空间维度 $d$ 和风险敏感因子的依赖关系。据我们所知,我们获得了首个可证明高效的风险敏感离线RL算法。