This dissertation makes three main contributions. First, We identify a new connection between policy gradient and dynamic programming in MMDPs and propose the Coordinate Ascent Dynamic Programming (CADP) algorithm to compute a Markov policy that maximizes the discounted return averaged over the uncertain models. CADP adjusts model weights iteratively to guarantee monotone policy improvements to a local maximum. Second, We establish sufficient and necessary conditions for the exponential ERM Bellman operator to be a contraction and prove the existence of stationary deterministic optimal policies for ERM-TRC and EVaR-TRC. We also propose exponential value iteration, policy iteration, and linear programming algorithms for computing optimal stationary policies for ERM-TRC and EVaR-TRC. Third, We propose model-free Q-learning algorithms for computing policies with risk-averse objectives: ERM-TRC and EVaR-TRC. The challenge is that Q-learning ERM Bellman may not be a contraction. Instead, we use the monotonicity of Q-learning ERM Bellman operators to derive a rigorous proof that the ERM-TRC and the EVaR-TRC Q-learning algorithms converge to the optimal risk-averse value functions. The proposed Q-learning algorithms compute the optimal stationary policy for ERM-TRC and EVaR-TRC.
翻译:本论文作出三项主要贡献。首先,我们揭示了马尔可夫决策过程(MMDPs)中策略梯度与动态规划之间的新关联,并提出坐标上升动态规划(CADP)算法,用于计算在不确定模型上平均折扣回报最大化的马尔可夫策略。CADP通过迭代调整模型权重,保证策略单调改进至局部最优。其次,我们建立了指数经验风险最小化(ERM)贝尔曼算子为压缩算子的充分必要条件,证明了ERM-TRC与EVaR-TRC存在平稳确定性最优策略,并提出了指数值迭代、策略迭代及线性规划算法以计算ERM-TRC与EVaR-TRC的最优平稳策略。第三,我们提出了面向风险规避目标(ERM-TRC与EVaR-TRC)的无模型Q学习算法。核心挑战在于Q学习ERM贝尔曼算子可能不具备压缩性。为此,我们利用Q学习ERM贝尔曼算子的单调性,严格证明了ERM-TRC与EVaR-TRC Q学习算法收敛于最优风险规避值函数。所提出的Q学习算法能够计算ERM-TRC与EVaR-TRC的最优平稳策略。