While multimodal large language models (MLLMs) provide advanced reasoning for autonomous driving, translating their discrete semantic knowledge into continuous trajectories remains a fundamental challenge. Existing methods often rely on unimodal planning heads that inherently limit their ability to represent multimodal driving behavior. Furthermore, most generative approaches frequently condition on one-hot encoded actions, discarding the nuanced navigational uncertainty critical for complex scenarios. To resolve these limitations, we introduce LAD-Drive, a generative framework that structurally disentangles high-level intention from low-level spatial planning. LAD-Drive employs an action decoder to infer a probabilistic meta-action distribution, establishing an explicit belief state that preserves the nuanced intent typically lost by one-hot encodings. This distribution, fused with the vehicle's kinematic state, conditions an action-aware diffusion decoder that utilizes a truncated denoising process to refine learned motion anchors into safe, kinematically feasible trajectories. Extensive evaluations on the LangAuto benchmark demonstrate that LAD-Drive achieves state-of-the-art results, outperforming competitive baselines by up to 59% in Driving Score while significantly reducing route deviations and collisions. We will publicly release the code and models on https://github.com/iis-esslingen/lad-drive.
翻译:尽管多模态大语言模型(MLLMs)为自动驾驶提供了先进的推理能力,但将其离散的语义知识转化为连续的轨迹仍然是一个根本性挑战。现有方法通常依赖于单模态规划头,这固有地限制了其表示多模态驾驶行为的能力。此外,大多数生成方法常常以独热编码的动作作为条件,丢弃了对复杂场景至关重要的细微导航不确定性。为解决这些局限性,我们提出了LAD-Drive,这是一个在结构上将高层意图与低层空间规划解耦的生成式框架。LAD-Drive采用一个动作解码器来推断概率性元动作分布,从而建立一个明确的信念状态,以保留通常被独热编码丢失的细微意图。该分布与车辆的动力学状态融合,作为条件输入到一个动作感知扩散解码器;该解码器利用截断去噪过程,将学习到的运动锚点细化为安全且动力学可行的轨迹。在LangAuto基准上进行的大量评估表明,LAD-Drive取得了最先进的结果,其驾驶评分比竞争基线高出高达59%,同时显著减少了路线偏离和碰撞。我们将在 https://github.com/iis-esslingen/lad-drive 上公开代码和模型。