As a framework for sequential decision-making, Reinforcement Learning (RL) has been regarded as an essential component leading to Artificial General Intelligence (AGI). However, RL is often criticized for having the same training environment as the test one, which also hinders its application in the real world. To mitigate this problem, Distributionally Robust RL (DRRL) is proposed to improve the worst performance in a set of environments that may contain the unknown test environment. Due to the nonlinearity of the robustness goal, most of the previous work resort to the model-based approach, learning with either an empirical distribution learned from the data or a simulator that can be sampled infinitely, which limits their applications in simple dynamics environments. In contrast, we attempt to design a DRRL algorithm that can be trained along a single trajectory, i.e., no repeated sampling from a state. Based on the standard Q-learning, we propose distributionally robust Q-learning with the single trajectory (DRQ) and its average-reward variant named differential DRQ. We provide asymptotic convergence guarantees and experiments for both settings, demonstrating their superiority in the perturbed environments against the non-robust ones.
翻译:强化学习(Reinforcement Learning, RL)作为序列决策制定的框架,一直被视为通向通用人工智能(Artificial General Intelligence, AGI)的关键组成部分。然而,RL常因训练环境与测试环境相同而受到批评,这也阻碍了其在现实世界中的应用。为缓解这一问题,分布鲁棒RL(Distributionally Robust RL, DRRL)被提出以提升可能包含未知测试环境的环境集合中的最差性能。由于鲁棒性目标的非线性特性,以往大多数工作采用基于模型的方法,通过从数据中学习的经验分布或可无限采样的模拟器进行学习,这限制了它们在简单动力学环境中的应用。相反,我们尝试设计一种可沿单轨迹训练的DRRL算法,即无需从某一状态重复采样。基于标准Q学习,我们提出单轨迹分布鲁棒Q学习(DRQ)及其平均奖励变体——微分DRQ。我们提供了两种设置的渐近收敛性保证与实验验证,证明了其在扰动环境中相对于非鲁棒方法的优越性。