This paper endeavors to augment the robustness of offline reinforcement learning (RL) in scenarios laden with heavy-tailed rewards, a prevalent circumstance in real-world applications. We propose two algorithmic frameworks, ROAM and ROOM, for robust off-policy evaluation (OPE) and offline policy optimization (OPO), respectively. Central to our frameworks is the strategic incorporation of the median-of-means method with offline RL, enabling straightforward uncertainty estimation for the value function estimator. This not only adheres to the principle of pessimism in OPO but also adeptly manages heavy-tailed rewards. Theoretical results and extensive experiments demonstrate that our two frameworks outperform existing methods on the logged dataset exhibits heavy-tailed reward distributions.
翻译:本文致力于增强离线强化学习在重尾奖励场景下的鲁棒性——这是现实应用中常见的挑战。我们提出了两种算法框架ROAM与ROOM,分别针对鲁棒的离策略评估和离线策略优化。其核心在于将均值中位数方法与离线强化学习策略性结合,实现对值函数估计器的不确定性直接量化。这一方法不仅遵循离线策略优化中的悲观原则,还能有效处理重尾奖励分布。理论分析与大量实验表明,当日志数据集呈现重尾奖励分布时,我们的两种框架均优于现有方法。