Offline reinforcement learning (RL) presents a promising approach for learning reinforced policies from offline datasets without the need for costly or unsafe interactions with the environment. However, datasets collected by humans in real-world environments are often noisy and may even be maliciously corrupted, which can significantly degrade the performance of offline RL. In this work, we first investigate the performance of current offline RL algorithms under comprehensive data corruption, including states, actions, rewards, and dynamics. Our extensive experiments reveal that implicit Q-learning (IQL) demonstrates remarkable resilience to data corruption among various offline RL algorithms. Furthermore, we conduct both empirical and theoretical analyses to understand IQL's robust performance, identifying its supervised policy learning scheme as the key factor. Despite its relative robustness, IQL still suffers from heavy-tail targets of Q functions under dynamics corruption. To tackle this challenge, we draw inspiration from robust statistics to employ the Huber loss to handle the heavy-tailedness and utilize quantile estimators to balance penalization for corrupted data and learning stability. By incorporating these simple yet effective modifications into IQL, we propose a more robust offline RL approach named Robust IQL (RIQL). Extensive experiments demonstrate that RIQL exhibits highly robust performance when subjected to diverse data corruption scenarios.
翻译:离线强化学习(Offline RL)提供了一种无需与环境进行昂贵或不安全交互即可从离线数据集中学习强化策略的可行方法。然而,由人类在真实环境中收集的数据集往往存在噪声,甚至可能被恶意篡改,这会严重降低离线强化学习的性能。本文首先系统研究了当前离线强化学习算法在包括状态、动作、奖励和动力学在内的全面数据损坏下的表现。大量实验表明,隐式Q学习(IQL)在多种离线强化学习算法中对数据损坏展现出显著的鲁棒性。进一步,我们通过实证与理论分析揭示了IQL鲁棒性能的关键因素在于其监督式策略学习机制。尽管IQL具有相对鲁棒性,但在动力学损坏场景下,其Q函数仍会受到重尾目标的严重影响。针对这一挑战,我们借鉴鲁棒统计学的思想,采用Huber损失处理重尾性,并利用分位数估计器平衡对损坏数据的惩罚与学习稳定性。通过将上述简单而有效的改进融入IQL,我们提出了一种名为鲁棒IQL(RIQL)的鲁棒离线强化学习方法。大量实验证明,RIQL在多种数据损坏场景下均展现出高度鲁棒的性能。