This work pioneers regret analysis of risk-sensitive reinforcement learning in partially observable environments with hindsight observation, addressing a gap in theoretical exploration. We introduce a novel formulation that integrates hindsight observations into a Partially Observable Markov Decision Process (POMDP) framework, where the goal is to optimize accumulated reward under the entropic risk measure. We develop the first provably efficient RL algorithm tailored for this setting. We also prove by rigorous analysis that our algorithm achieves polynomial regret $\tilde{O}\left(\frac{e^{|{\gamma}|H}-1}{|{\gamma}|H}H^2\sqrt{KHS^2OA}\right)$, which outperforms or matches existing upper bounds when the model degenerates to risk-neutral or fully observable settings. We adopt the method of change-of-measure and develop a novel analytical tool of beta vectors to streamline mathematical derivations. These techniques are of particular interest to the theoretical study of reinforcement learning.
翻译:本文开创性地研究了在后见观察条件下,部分可观察环境中风险敏感强化学习的遗憾分析,填补了理论探索的空白。我们提出了一种新颖的框架,将后见观察整合到部分可观察马尔可夫决策过程(POMDP)中,其目标是在熵风险度量下优化累积奖励。我们针对这一场景开发了首个高效可证的强化学习算法。通过严谨分析,我们证明了该算法实现了多项式遗憾 $\tilde{O}\left(\frac{e^{|{\gamma}|H}-1}{|{\gamma}|H}H^2\sqrt{KHS^2OA}\right)$,当模型退化为风险中性或完全可观察设置时,该结果优于或匹配现有的上界。我们采用测度变化方法,并开发了新颖的贝塔向量分析工具以简化数学推导。这些技术对强化学习的理论研究具有特殊意义。