Offline Reinforcement Learning (RL) has emerged as a promising framework for learning policies without active interactions, making it especially appealing for autonomous driving tasks. Recent successes of Transformers inspire casting offline RL as sequence modeling, which performs well in long-horizon tasks. However, they are overly optimistic in stochastic environments with incorrect assumptions that the same goal can be consistently achieved by identical actions. In this paper, we introduce an UNcertainty-awaRE deciSion Transformer (UNREST) for planning in stochastic driving environments without introducing additional transition or complex generative models. Specifically, UNREST estimates state uncertainties by the conditional mutual information between transitions and returns, and segments sequences accordingly. Discovering the `uncertainty accumulation' and `temporal locality' properties of driving environments, UNREST replaces the global returns in decision transformers with less uncertain truncated returns, to learn from true outcomes of agent actions rather than environment transitions. We also dynamically evaluate environmental uncertainty during inference for cautious planning. Extensive experimental results demonstrate UNREST's superior performance in various driving scenarios and the power of our uncertainty estimation strategy.
翻译:离线强化学习作为一种无需主动交互即可学习策略的框架,在自动驾驶任务中展现出巨大潜力。Transformer模型的成功启发了将离线强化学习转化为序列建模的方法,这在长时域任务中表现优异。然而,此类方法在随机环境中过于乐观,其隐含假设认为相同动作总能稳定达成相同目标。本文提出一种面向随机驾驶环境的考虑不确定性的决策变换器,无需引入额外转移模型或复杂生成模型即可实现规划。具体而言,UNREST通过转移与回报间的条件互信息估计状态不确定性,并据此对序列进行分割。通过发现驾驶环境的"不确定性累积"与"时间局部性"特性,UNREST将决策变换器中的全局回报替换为低不确定性的截断回报,使模型能够从智能体动作的真实结果而非环境转移中学习。在推理阶段,我们还动态评估环境不确定性以实现谨慎规划。大量实验结果表明,UNREST在多种驾驶场景中均展现出卓越性能,验证了所提不确定性估计策略的有效性。