Sparse Offline Reinforcement Learning with Corruption Robustness

We investigate robustness to strong data corruption in offline sparse reinforcement learning (RL). In our setting, an adversary may arbitrarily perturb a fraction of the collected trajectories from a high-dimensional but sparse Markov decision process, and our goal is to estimate a near optimal policy. The main challenge is that, in the high-dimensional regime where the number of samples $N$ is smaller than the feature dimension $d$, exploiting sparsity is essential for obtaining non-vacuous guarantees but has not been systematically studied in offline RL. We analyse the problem under uniform coverage and sparse single-concentrability assumptions. While Least Square Value Iteration (LSVI), a standard approach for robust offline RL, performs well under uniform coverage, we show that integrating sparsity into LSVI is unnatural, and its analysis may break down due to overly pessimistic bonuses. To overcome this, we propose actor-critic methods with sparse robust estimator oracles, which avoid the use of pointwise pessimistic bonuses and provide the first non-vacuous guarantees for sparse offline RL under single-policy concentrability coverage. Moreover, we extend our results to the contaminated setting and show that our algorithm remains robust under strong contamination. Our results provide the first non-vacuous guarantees in high-dimensional sparse MDPs with single-policy concentrability coverage and corruption, showing that learning a near-optimal policy remains possible in regimes where traditional robust offline RL techniques may fail.

翻译：本研究探讨了高维稀疏马尔可夫决策过程中离线稀疏强化学习对强数据干扰的鲁棒性问题。在此设定下，对抗者可能对收集到的轨迹中一定比例的数据进行任意扰动，我们的目标是估计出接近最优的策略。核心挑战在于：在样本数量$N$小于特征维度$d$的高维场景下，利用稀疏性对于获得非平凡的理论保证至关重要，但该问题在离线强化学习中尚未得到系统研究。我们在均匀覆盖假设与稀疏单策略可集中性假设下分析了该问题。虽然鲁棒离线强化学习的标准方法——最小二乘值迭代在均匀覆盖条件下表现良好，但我们发现将稀疏性整合到该方法中存在本质困难，其分析可能因过于悲观的奖励修正项而失效。为此，我们提出了基于稀疏鲁棒估计器预言机的行动者-评论家方法，该方法避免了逐点悲观奖励修正项的使用，首次为单策略可集中性覆盖条件下的稀疏离线强化学习提供了非平凡的理论保证。此外，我们将结果扩展到污染数据场景，证明所提算法在强污染条件下仍保持鲁棒性。本研究首次为具有单策略可集中性覆盖与数据污染的高维稀疏马尔可夫决策过程提供了非平凡的理论保证，证明了在传统鲁棒离线强化学习方法可能失效的场景中，学习接近最优策略仍然是可行的。