Reinforcement learning (RL) for exponential-utility optimization in discounted Markov decision processes (MDPs) lacks principled value-based algorithms. We address this gap in the fixed risk-aversion setting. Building on the Bellman-type equation for exponential utility studied in \cite{porteus1975optimality}, we derive two Q-value-style extensions and show that the associated operators are contractions in the $L_\infty$ and sup-log/Thompson metrics, respectively. We characterize their fixed points and prove that the induced greedy stationary policy is optimal for the exponential-utility objective among stationary policies. These structural results lead to two model-free algorithms: a two-timescale Q-learning--style algorithm, for which we establish almost-sure convergence and provide finite-time convergence rates via timescale separation, and a one-timescale algorithm governed by a sublinear power-law operator. Since the latter does not admit a global contraction in standard metrics, we prove its convergence using delicate arguments based on local Lipschitzness, monotonicity, homogeneity, and Dini derivatives, and provide a scalar finite-time analysis that highlights the challenges in obtaining convergence rates in the vector case. Our work provides a foundation for value-based RL under exponential-utility objectives.
翻译:在折扣马尔可夫决策过程(MDPs)中,针对指数效用优化的强化学习(RL)缺乏基于价值的系统性算法。我们针对固定风险厌恶设定填补了这一空白。基于文献\cite{porteus1975optimality}中研究的指数效用贝尔曼型方程,我们推导了两种Q值形式的扩展,并证明了相应算子分别在$L_\infty$度量和sup-log/Thompson度量下具有压缩性。我们刻画了它们的不动点,并证明由此诱导的贪心平稳策略在所有平稳策略中对于指数效用目标是最优的。这些结构性质引出了两种无模型算法:一种双时间尺度Q学习型算法——我们建立了其几乎必然收敛性并通过时间尺度分离给出了有限时间收敛速率;以及一种由次线性幂律算子控制的单时间尺度算法。由于后者在标准度量下不具有全局压缩性,我们基于局部利普希茨性、单调性、齐次性和迪尼导数运用精细论证证明了其收敛性,并给出了标量情况下的有限时间分析,揭示了在向量情形下获取收敛速率所面临的挑战。我们的工作为基于指数效用目标的基于价值RL奠定了基础。