Offline Reinforcement Learning (RL) enables policy learning without active interactions, making it especially appealing for self-driving tasks. Recent successes of Transformers inspire casting offline RL as sequence modeling, which, however, fails in stochastic environments with incorrect assumptions that identical actions can consistently achieve the same goal. In this paper, we introduce an UNcertainty-awaRE deciSion Transformer (UNREST) for planning in stochastic driving environments without introducing additional transition or complex generative models. Specifically, UNREST estimates uncertainties by conditional mutual information between transitions and returns. Discovering 'uncertainty accumulation' and 'temporal locality' properties of driving environments, we replace the global returns in decision transformers with truncated returns less affected by environments to learn from actual outcomes of actions rather than environment transitions. We also dynamically evaluate uncertainty at inference for cautious planning. Extensive experiments demonstrate UNREST's superior performance in various driving scenarios and the power of our uncertainty estimation strategy.
翻译:离线强化学习(RL)使得策略学习无需主动交互,这使其在自动驾驶任务中尤为具有吸引力。Transformer近期取得的成功启发了将离线RL视为序列建模的思路,然而,该方法在随机环境中会失效,因为它错误地假设相同的动作总能一致地达成相同目标。本文提出了一种不确定性感知决策Transformer(UNREST),用于在随机驾驶环境中进行规划,且无需引入额外的转移模型或复杂的生成模型。具体而言,UNREST通过转移与回报之间的条件互信息来估计不确定性。通过发现驾驶环境中存在的“不确定性累积”和“时间局部性”特性,我们用受环境影响较小的截断回报替代决策Transformer中的全局回报,从而从动作的实际结果而非环境转移中学习。我们还在推理阶段动态评估不确定性,以实现谨慎规划。大量实验表明,UNREST在多种驾驶场景中均表现出优越性能,并且我们的不确定性估计策略具有强大效力。