Reinforcement Learning is a powerful framework for training agents to navigate different situations, but it is susceptible to changes in environmental dynamics. However, solving Markov Decision Processes that are robust to changes is difficult due to nonconvexity and size of action or state spaces. While most works have analyzed this problem by taking different assumptions on the problem, a general and efficient theoretical analysis is still missing. However, we generate a simple framework for improving robustness by solving a minimax iterative optimization problem where a policy player and an environmental dynamics player are playing against each other. Leveraging recent results in online nonconvex learning and techniques from improving policy gradient methods, we yield an algorithm that maximizes the robustness of the Value Function on the order of $\mathcal{O}\left(\frac{1}{T^{\frac{1}{2}}}\right)$ where $T$ is the number of iterations of the algorithm.
翻译:强化学习是一个训练智能体应对不同情境的强大框架,但它容易受到环境动力学变化的影响。然而,由于动作空间或状态空间的非凸性及其规模,求解对环境变化具有鲁棒性的马尔可夫决策过程十分困难。尽管大多数研究通过针对问题施加不同假设来分析该问题,但仍缺乏通用且高效的理论分析。为此,我们提出一个简单的框架,通过求解一个极小极大迭代优化问题来提升鲁棒性,其中策略玩家与环境动力学玩家相互对抗。利用在线非凸学习的最新成果以及改进策略梯度方法的技术,我们得到一种算法,该算法能够使价值函数的鲁棒性达到$\mathcal{O}\left(\frac{1}{T^{\frac{1}{2}}}\right)$量级,其中$T$为算法迭代次数。