Offline Reinforcement Learning (RL) has emerged as a powerful alternative to imitation learning for behavior modeling in various domains, particularly in complex navigation tasks. An existing challenge with Offline RL is the signal-to-noise ratio, i.e. how to mitigate incorrect policy updates due to errors in value estimates. Towards this, multiple works have demonstrated the advantage of hierarchical offline RL methods, which decouples high-level path planning from low-level path following. In this work, we present a novel hierarchical transformer-based approach leveraging a learned quantizer of the space. This quantization enables the training of a simpler zone-conditioned low-level policy and simplifies planning, which is reduced to discrete autoregressive prediction. Among other benefits, zone-level reasoning in planning enables explicit trajectory stitching rather than implicit stitching based on noisy value function estimates. By combining this transformer-based planner with recent advancements in offline RL, our proposed approach achieves state-of-the-art results in complex long-distance navigation environments.
翻译:离线强化学习已成为模仿学习在各种领域(特别是复杂导航任务中)进行行为建模的强大替代方案。离线强化学习面临的一个现有挑战是信噪比问题,即如何减少因价值估计误差导致的错误策略更新。针对此问题,多项研究已证明分层离线强化学习方法的优势,该方法将高层路径规划与低层路径跟踪解耦。本文提出一种新颖的基于分层Transformer的方法,该方法利用学习得到的空间量化器。这种量化使得训练更简单的区域条件低层策略成为可能,并简化了规划过程——规划被简化为离散自回归预测。除其他优势外,规划中的区域级推理能够实现显式的轨迹拼接,而非基于噪声价值函数估计的隐式拼接。通过将这种基于Transformer的规划器与离线强化学习的最新进展相结合,我们提出的方法在复杂长距离导航环境中取得了最先进的结果。