We study best-policy identification for finite-horizon risk-sensitive reinforcement learning under the entropic risk measure. Recent work established a constant gap in the exponential horizon dependence between lower and upper bounds on the number of samples required to identify an approximately optimal policy. Precisely, known lower bounds scale in $Ω(e^{|β| H})$ where $H$ is the horizon of the MDP, while the state-of-the-art upper bound achieves at best $O(e^{2|β| H})$ (arXiv:2506.00286v2) using a generative model. We show that this extra exponential factor can be traced to overly loose concentration control for exponential utilities. To close this open gap, we revisit the analysis of this problem through a forward-model based algorithm building on KL-based exploration bonuses that we adapt to the entropic criterion. The improvement we get is due to two main novel technical innovations. We leverage the smoothness properties of the exponential utility to derive sharper concentration bounds, and we propose a new stopping rule that exploits further this tightness to obtain a sample complexity that matches the lower bound.
翻译:我们研究了有限时域风险敏感强化学习在熵风险度量下的最优策略识别问题。近期研究表明,在识别近似最优策略所需样本数量的指数级时域依赖性方面,下界与上界之间存在常数差距。具体而言,已知下界复杂度为 $Ω(e^{|β| H})$(其中 $H$ 为MDP的时域长度),而基于生成模型的最新上界(arXiv:2506.00286v2)至多达到 $O(e^{2|β| H})$。我们证明这一额外的指数因子源于对指数效用函数的过度宽松的集中性控制。为弥合这一开放性问题,我们重新分析了该问题,设计了一种基于前向模型的算法——该算法通过对数KL探索奖赏机制进行改造以适应熵准则。本文的创新性改进源于两项关键技术突破:利用指数效用的光滑性导出更紧的集中不等式,并提出一种能进一步发挥该紧致性优势的新型停时规则,最终使样本复杂度匹配下界。