Distributionally robust offline reinforcement learning (RL), which seeks robust policy training against environment perturbation by modeling dynamics uncertainty, calls for function approximations when facing large state-action spaces. However, the consideration of dynamics uncertainty introduces essential nonlinearity and computational burden, posing unique challenges for analyzing and practically employing function approximation. Focusing on a basic setting where the nominal model and perturbed models are linearly parameterized, we propose minimax optimal and computationally efficient algorithms realizing function approximation and initiate the study on instance-dependent suboptimality analysis in the context of robust offline RL. Our results uncover that function approximation in robust offline RL is essentially distinct from and probably harder than that in standard offline RL. Our algorithms and theoretical results crucially depend on a variety of new techniques, involving a novel function approximation mechanism incorporating variance information, a new procedure of suboptimality and estimation uncertainty decomposition, a quantification of the robust value function shrinkage, and a meticulously designed family of hard instances, which might be of independent interest.
翻译:分布鲁棒离线强化学习通过建模动态不确定性来应对环境扰动,从而训练鲁棒策略。当面对大规模状态-动作空间时,该方法需要引入函数近似。然而,动态不确定性的引入带来了本质非线性和计算负担,为函数近似的分析与实际应用带来了独特挑战。本文聚焦于名义模型与扰动模型均为线性参数化的基本设定,提出了实现函数近似的最小最大最优且计算高效的算法,并首次在鲁棒离线强化学习背景下开展了实例相关次优性分析研究。我们的结果表明,鲁棒离线强化学习中的函数近似本质上与标准离线强化学习不同,且可能更为困难。本文的算法与理论结果关键依赖于多种新技术,包括融合方差信息的新型函数近似机制、次优性与估计不确定性分解的新流程、鲁棒值函数收缩的定量刻画,以及精心设计的一类困难实例,这些技术可能具有独立的研究价值。