Doubly Mild Generalization for Offline Reinforcement Learning

Offline Reinforcement Learning (RL) suffers from the extrapolation error and value overestimation. From a generalization perspective, this issue can be attributed to the over-generalization of value functions or policies towards out-of-distribution (OOD) actions. Significant efforts have been devoted to mitigating such generalization, and recent in-sample learning approaches have further succeeded in entirely eschewing it. Nevertheless, we show that mild generalization beyond the dataset can be trusted and leveraged to improve performance under certain conditions. To appropriately exploit generalization in offline RL, we propose Doubly Mild Generalization (DMG), comprising (i) mild action generalization and (ii) mild generalization propagation. The former refers to selecting actions in a close neighborhood of the dataset to maximize the Q values. Even so, the potential erroneous generalization can still be propagated, accumulated, and exacerbated by bootstrapping. In light of this, the latter concept is introduced to mitigate the generalization propagation without impeding the propagation of RL learning signals. Theoretically, DMG guarantees better performance than the in-sample optimal policy in the oracle generalization scenario. Even under worst-case generalization, DMG can still control value overestimation at a certain level and lower bound the performance. Empirically, DMG achieves state-of-the-art performance across Gym-MuJoCo locomotion tasks and challenging AntMaze tasks. Moreover, benefiting from its flexibility in both generalization aspects, DMG enjoys a seamless transition from offline to online learning and attains strong online fine-tuning performance.

翻译：离线强化学习（RL）面临外推误差与价值高估问题。从泛化视角看，此问题可归因于价值函数或策略对分布外（OOD）动作的过度泛化。现有研究已投入大量努力以缓解此类泛化，近期基于样本内学习的方法更进一步实现了完全规避泛化。然而，本文证明在特定条件下，数据边界之外的适度泛化仍具可信度，并能用于提升算法性能。为合理利用离线RL中的泛化特性，我们提出双重温和泛化（DMG）框架，包含：（i）动作温和泛化与（ii）泛化传播温和化。前者指在数据集邻域内选择动作以最大化Q值。即便如此，潜在的错误泛化仍可能通过自举机制传播、累积并放大。鉴于此，我们引入后者概念以抑制泛化传播，同时确保RL学习信号的正常传递。理论分析表明，在理想泛化场景下，DMG能保证优于样本内最优策略的性能。即使在最差泛化条件下，DMG仍能将价值高估控制在特定水平，并为性能提供下界保证。实证实验中，DMG在Gym-MuJoCo运动控制任务与高难度AntMaze任务中均达到最先进性能。此外，得益于其在双重泛化维度上的灵活性，DMG能实现从离线到在线学习的无缝过渡，并在在线微调中取得卓越表现。