Risk-sensitive reinforcement learning (RL) aims to optimize policies that balance the expected reward and risk. In this paper, we present a novel risk-sensitive RL framework that employs an Iterated Conditional Value-at-Risk (CVaR) objective under both linear and general function approximations, enriched by human feedback. These new formulations provide a principled way to guarantee safety in each decision making step throughout the control process. Moreover, integrating human feedback into risk-sensitive RL framework bridges the gap between algorithmic decision-making and human participation, allowing us to also guarantee safety for human-in-the-loop systems. We propose provably sample-efficient algorithms for this Iterated CVaR RL and provide rigorous theoretical analysis. Furthermore, we establish a matching lower bound to corroborate the optimality of our algorithms in a linear context.
翻译:风险敏感强化学习(Risk-sensitive RL)旨在优化平衡期望收益与风险的策略。本文提出了一种新颖的风险敏感强化学习框架,该框架在线性和一般函数逼近条件下采用迭代条件风险价值(Iterated Conditional Value-at-Risk, CVaR)目标函数,并融入人类反馈。这些新公式为保证控制过程中每个决策步骤的安全性提供了原则性方法。此外,将人类反馈整合到风险敏感强化学习框架中,弥合了算法决策与人类参与之间的差距,使我们能够同时保障人在环系统的安全性。我们针对这一迭代CVaR强化学习问题提出了可证明样本高效的算法,并给出了严格的理论分析。进一步地,我们在线性场景下建立了匹配的下界,以证实我们算法的最优性。