Policy robustness in Reinforcement Learning may not be desirable at any cost: the alterations caused by robustness requirements from otherwise optimal policies should be explainable, quantifiable and formally verifiable. In this work we study how policies can be maximally robust to arbitrary observational noise by analysing how they are altered by this noise through a stochastic linear operator interpretation of the disturbances, and establish connections between robustness and properties of the noise kernel and of the underlying MDPs. Then, we construct sufficient conditions for policy robustness, and propose a robustness-inducing scheme, applicable to any policy gradient algorithm, that formally trades off expected policy utility for robustness through lexicographic optimisation, while preserving convergence and sub-optimality in the policy synthesis.
翻译:强化学习中的策略鲁棒性并非在任何情况下都是可取的:由鲁棒性要求导致的策略偏离最优策略的变化应当是可解释、可量化且可形式化验证的。本文通过将扰动解释为随机线性算子,研究策略如何对任意观测噪声具有最大鲁棒性,分析噪声如何通过该算子改变策略,并建立鲁棒性与噪声核及底层MDP性质之间的联系。随后,我们构建策略鲁棒性的充分条件,提出一种适用于任何策略梯度算法的鲁棒性诱导方案,该方案通过词典序优化在形式上以策略效用换取鲁棒性,同时保持策略综合过程中的收敛性与次优性。