Learning policies from offline datasets through offline reinforcement learning (RL) holds promise for scaling data-driven decision-making and avoiding unsafe and costly online interactions. However, real-world data collected from sensors or humans often contains noise and errors, posing a significant challenge for existing offline RL methods. Our study indicates that traditional offline RL methods based on temporal difference learning tend to underperform Decision Transformer (DT) under data corruption, especially when the amount of data is limited. This suggests the potential of sequential modeling for tackling data corruption in offline RL. To further unleash the potential of sequence modeling methods, we propose Robust Decision Transformer (RDT) by incorporating several robust techniques. Specifically, we introduce Gaussian weighted learning and iterative data correction to reduce the effect of corrupted data. Additionally, we leverage embedding dropout to enhance the model's resistance to erroneous inputs. Extensive experiments on MoJoCo, KitChen, and Adroit tasks demonstrate RDT's superior performance under diverse data corruption compared to previous methods. Moreover, RDT exhibits remarkable robustness in a challenging setting that combines training-time data corruption with testing-time observation perturbations. These results highlight the potential of robust sequence modeling for learning from noisy or corrupted offline datasets, thereby promoting the reliable application of offline RL in real-world tasks.
翻译:通过离线强化学习从离线数据集中学习策略,有望扩展数据驱动的决策能力,并避免不安全且成本高昂的在线交互。然而,从传感器或人类收集的现实世界数据通常包含噪声和误差,这对现有的离线强化学习方法构成了重大挑战。我们的研究表明,基于时序差分学习的传统离线强化学习方法在数据损坏情况下往往表现不及决策Transformer,尤其是在数据量有限时。这表明序列建模在应对离线强化学习中的数据损坏方面具有潜力。为了进一步释放序列建模方法的潜力,我们通过引入多种鲁棒技术,提出了鲁棒决策Transformer。具体而言,我们引入了高斯加权学习和迭代数据校正以减少损坏数据的影响。此外,我们利用嵌入丢弃技术来增强模型对错误输入的抵抗能力。在MoJoCo、KitChen和Adroit任务上进行的大量实验表明,与先前方法相比,RDT在多种数据损坏情况下均表现出优越的性能。此外,RDT在训练时数据损坏与测试时观测扰动相结合的挑战性设置中展现出显著的鲁棒性。这些结果突显了鲁棒序列建模在从噪声或损坏的离线数据集中学习的潜力,从而促进了离线强化学习在现实任务中的可靠应用。