Average-reward Markov decision processes (MDPs) provide a foundational framework for sequential decision-making under uncertainty. However, average-reward MDPs have remained largely unexplored in reinforcement learning (RL) settings, with the majority of RL-based efforts having been allocated to episodic and discounted MDPs. In this work, we study a unique structural property of average-reward MDPs and utilize it to introduce Reward-Extended Differential (or RED) reinforcement learning: a novel RL framework that can be used to effectively and efficiently solve various subtasks simultaneously in the average-reward setting. We introduce a family of RED learning algorithms for prediction and control, including proven-convergent algorithms for the tabular case. We then showcase the power of these algorithms by demonstrating how they can be used to learn a policy that optimizes, for the first time, the well-known conditional value-at-risk (CVaR) risk measure in a fully-online manner, without the use of an explicit bi-level optimization scheme or an augmented state-space.
翻译:平均奖励马尔可夫决策过程为不确定性下的序贯决策提供了一个基础框架。然而,平均奖励MDP在强化学习领域很大程度上仍未得到充分探索,大多数基于RL的研究工作都集中于片段式和折扣式MDP。在本工作中,我们研究了平均奖励MDP的一种独特结构特性,并利用它引入了奖励扩展差分强化学习:这是一种新颖的RL框架,可用于在平均奖励设定下有效且高效地同时解决各种子任务。我们提出了一系列用于预测和控制的RED学习算法,包括表格情况下已证明收敛的算法。随后,我们通过展示这些算法如何首次用于学习一种策略,以完全在线的方式优化著名的条件风险价值风险度量,而无需使用显式的双层优化方案或增广状态空间,从而彰显了这些算法的强大能力。