Corruption-Robust Offline Reinforcement Learning with General Function Approximation

We investigate the problem of corruption robustness in offline reinforcement learning (RL) with general function approximation, where an adversary can corrupt each sample in the offline dataset, and the corruption level $\zeta\geq0$ quantifies the cumulative corruption amount over $n$ episodes and $H$ steps. Our goal is to find a policy that is robust to such corruption and minimizes the suboptimality gap with respect to the optimal policy for the uncorrupted Markov decision processes (MDPs). Drawing inspiration from the uncertainty-weighting technique from the robust online RL setting \citep{he2022nearly,ye2022corruptionrobust}, we design a new uncertainty weight iteration procedure to efficiently compute on batched samples and propose a corruption-robust algorithm for offline RL. Notably, under the assumption of single policy coverage and the knowledge of $\zeta$, our proposed algorithm achieves a suboptimality bound that is worsened by an additive factor of $\mathcal O(\zeta \cdot (\text{CC}(\lambda,\hat{\mathcal F},\mathcal Z_n^H))^{1/2} (C(\hat{\mathcal F},\mu))^{-1/2} n^{-1})$ due to the corruption. Here $\text{CC}(\lambda,\hat{\mathcal F},\mathcal Z_n^H)$ is the coverage coefficient that depends on the regularization parameter $\lambda$, the confidence set $\hat{\mathcal F}$, and the dataset $\mathcal Z_n^H$, and $C(\hat{\mathcal F},\mu)$ is a coefficient that depends on $\hat{\mathcal F}$ and the underlying data distribution $\mu$. When specialized to linear MDPs, the corruption-dependent error term reduces to $\mathcal O(\zeta d n^{-1})$ with $d$ being the dimension of the feature map, which matches the existing lower bound for corrupted linear MDPs. This suggests that our analysis is tight in terms of the corruption-dependent term.

翻译：我们研究了具有通用函数近似的离线强化学习中的腐败鲁棒性问题，其中攻击者可以污染离线数据集中的每个样本，腐败水平$\zeta\geq0$量化了$n$个回合和$H$步上的累积污染量。我们的目标是找到一个能抵御此类腐败的策略，并使其与未腐败马尔可夫决策过程(MDPs)的最优策略之间的次优性差距最小化。借鉴鲁棒在线强化学习中不确定性加权技术的思路\citep{he2022nearly,ye2022corruptionrobust}，我们设计了一种新的不确定性权重迭代过程，以高效处理批量样本，并提出了一种适用于离线强化学习的抗腐败算法。值得注意的是，在单策略覆盖假设和已知$\zeta$的条件下，我们提出的算法获得的次优性界因腐败而增加了一个附加项$\mathcal O(\zeta \cdot (\text{CC}(\lambda,\hat{\mathcal F},\mathcal Z_n^H))^{1/2} (C(\hat{\mathcal F},\mu))^{-1/2} n^{-1})$。其中$\text{CC}(\lambda,\hat{\mathcal F},\mathcal Z_n^H)$是依赖正则化参数$\lambda$、置信集$\hat{\mathcal F}$和数据集$\mathcal Z_n^H$的覆盖系数，$C(\hat{\mathcal F},\mu)$是依赖$\hat{\mathcal F}$和底层数据分布$\mu$的系数。当特化为线性MDPs时，腐败相关误差项简化为$\mathcal O(\zeta d n^{-1})$，其中$d$是特征映射的维度，这与现有腐败线性MDPs的下界相匹配，表明我们的分析在腐败相关项方面是紧的。