Offline Reinforcement Learning (RL) has emerged as a promising framework for learning policies without active interactions, making it especially appealing for autonomous driving tasks. Recent successes of Transformers inspire casting offline RL as sequence modeling, which performs well in long-horizon tasks. However, they are overly optimistic in stochastic environments with incorrect assumptions that the same goal can be consistently achieved by identical actions. In this paper, we introduce an UNcertainty-awaRE deciSion Transformer (UNREST) for planning in stochastic driving environments without introducing additional transition or complex generative models. Specifically, UNREST estimates state uncertainties by the conditional mutual information between transitions and returns, and segments sequences accordingly. Discovering the `uncertainty accumulation' and `temporal locality' properties of driving environments, UNREST replaces the global returns in decision transformers with less uncertain truncated returns, to learn from true outcomes of agent actions rather than environment transitions. We also dynamically evaluate environmental uncertainty during inference for cautious planning. Extensive experimental results demonstrate UNREST's superior performance in various driving scenarios and the power of our uncertainty estimation strategy.
翻译:离线强化学习(Offline Reinforcement Learning, RL)已成为一种无需主动交互即可学习策略的有前景框架,尤其适用于自动驾驶任务。Transformer的最新成功启发将离线RL建模为序列建模方法,在长时域任务中表现优异。然而,此类方法在随机环境中过度乐观,错误地认为相同动作能始终实现相同目标。本文提出一种不确定性感知决策Transformer(UNREST),用于在随机驾驶环境中进行规划,无需引入额外的转移模型或复杂生成模型。具体而言,UNREST通过转移与回报之间的条件互信息估计状态不确定性,并据此对序列进行分割。基于驾驶环境中的"不确定性累积"与"时间局部性"特性,UNREST将决策Transformer中的全局回报替换为不确定性更低的截断回报,从而学习智能体动作的真实影响而非环境转移过程。此外,我们在推理阶段动态评估环境不确定性以实现谨慎规划。大量实验结果表明,UNREST在多种驾驶场景中具有优越性能,且所提出的不确定性估计策略效果显著。