The sim-to-real gap, which represents the disparity between training and testing environments, poses a significant challenge in reinforcement learning (RL). A promising approach to addressing this challenge is distributionally robust RL, often framed as a robust Markov decision process (RMDP). In this framework, the objective is to find a robust policy that achieves good performance under the worst-case scenario among all environments within a pre-specified uncertainty set centered around the training environment. Unlike previous work, which relies on a generative model or a pre-collected offline dataset enjoying good coverage of the deployment environment, we tackle robust RL via interactive data collection, where the learner interacts with the training environment only and refines the policy through trial and error. In this robust RL paradigm, two main challenges emerge: managing distributional robustness while striking a balance between exploration and exploitation during data collection. Initially, we establish that sample-efficient learning without additional assumptions is unattainable owing to the curse of support shift; i.e., the potential disjointedness of the distributional supports between the training and testing environments. To circumvent such a hardness result, we introduce the vanishing minimal value assumption to RMDPs with a total-variation (TV) distance robust set, postulating that the minimal value of the optimal robust value function is zero. We prove that such an assumption effectively eliminates the support shift issue for RMDPs with a TV distance robust set, and present an algorithm with a provable sample complexity guarantee. Our work makes the initial step to uncovering the inherent difficulty of robust RL via interactive data collection and sufficient conditions for designing a sample-efficient algorithm accompanied by sharp sample complexity analysis.
翻译:模拟到现实的差距(即训练环境与测试环境之间的差异)对强化学习构成了重大挑战。解决这一挑战的一种有前景方法是分布鲁棒强化学习,通常表述为鲁棒马尔可夫决策过程。在该框架中,目标是找到一个鲁棒策略,使其在以训练环境为中心、预先指定的不确定性集合内的所有环境中,在最坏情况下仍能取得良好性能。与以往依赖生成模型或对部署环境具有良好覆盖的预收集离线数据集的工作不同,我们通过交互式数据收集来处理鲁棒强化学习,其中学习器仅与训练环境交互,并通过试错方式逐步改进策略。在这种鲁棒强化学习范式下,出现两个主要挑战:管理分布鲁棒性,同时在数据收集过程中平衡探索与利用。首先,我们证明在无额外假设的情况下,由于支持集偏移的诅咒(即训练环境与测试环境分布支持集可能出现不重叠),无法实现样本高效学习。为规避这一困难结果,我们针对具有总变差距离鲁棒集的鲁棒马尔可夫决策过程引入消失最小值假设,假设最优鲁棒值函数的最小值为零。我们证明该假设可有效消除具有全变差距离鲁棒集的鲁棒马尔可夫决策过程中的支持集偏移问题,并提出一种具有可证明样本复杂度保证的算法。本研究首次揭示了交互式数据收集下鲁棒强化学习的内在困难,以及设计样本高效算法所需充分条件,并附有清晰的样本复杂度分析。