Neural Network Approximation for Pessimistic Offline Reinforcement Learning

Deep reinforcement learning (RL) has shown remarkable success in specific offline decision-making scenarios, yet its theoretical guarantees are still under development. Existing works on offline RL theory primarily emphasize a few trivial settings, such as linear MDP or general function approximation with strong assumptions and independent data, which lack guidance for practical use. The coupling of deep learning and Bellman residuals makes this problem challenging, in addition to the difficulty of data dependence. In this paper, we establish a non-asymptotic estimation error of pessimistic offline RL using general neural network approximation with $\mathcal{C}$-mixing data regarding the structure of networks, the dimension of datasets, and the concentrability of data coverage, under mild assumptions. Our result shows that the estimation error consists of two parts: the first converges to zero at a desired rate on the sample size with partially controllable concentrability, and the second becomes negligible if the residual constraint is tight. This result demonstrates the explicit efficiency of deep adversarial offline RL frameworks. We utilize the empirical process tool for $\mathcal{C}$-mixing sequences and the neural network approximation theory for the H\"{o}lder class to achieve this. We also develop methods to bound the Bellman estimation error caused by function approximation with empirical Bellman constraint perturbations. Additionally, we present a result that lessens the curse of dimensionality using data with low intrinsic dimensionality and function classes with low complexity. Our estimation provides valuable insights into the development of deep offline RL and guidance for algorithm model design.

翻译：深度强化学习在特定离线决策场景中已展现出显著成功，但其理论保证仍在发展阶段。现有离线强化学习理论工作主要关注平凡设定，如线性马尔可夫决策过程或依赖强假设与独立数据的一般函数逼近，缺乏对实际应用的指导。深度学习与贝尔曼残差的耦合使得该问题具有挑战性，数据依赖性更增加了难度。本文在温和假设下，利用网络结构、数据集维度、数据覆盖集中度，针对具有$\mathcal{C}$-混合数据的通用神经网络逼近，建立了悲观离线强化学习的非渐近估计误差。结果表明估计误差由两部分组成：第一部分以理想速率随样本量收敛至零，且集中度部分可控；第二部分在残差约束紧致时可忽略。该结果揭示了深度对抗离线强化学习框架的显式有效性。我们采用$\mathcal{C}$-混合序列的经验过程工具和Hölder类神经网络逼近理论实现上述结果，并开发了通过经验贝尔曼约束扰动来限定函数逼近导致的贝尔曼估计误差的方法。此外，我们提出利用低内在维度数据与低复杂度函数类来减轻维度灾难的结果。本估计为深度离线强化学习的发展及算法模型设计提供了重要见解。