The emergence of multi-modal foundation models has markedly transformed the technology for autonomous driving, shifting away from conventional and mostly hand-crafted design choices towards unified, foundation-model-based approaches, capable of directly inferring motion trajectories from raw sensory inputs. This new class of methods can also incorporate natural language as an additional modality, with Vision-Language-Action (VLA) models serving as a representative example. In this review, we provide a comprehensive examination of such methods through a unifying taxonomy to critically evaluate their architectural design choices, methodological strengths, and their inherent capabilities and limitations. Our survey covers 37 recently proposed approaches that span the landscape of trajectory planning with foundation models. Furthermore, we assess these approaches with respect to the openness of their source code and datasets, offering valuable information to practitioners and researchers. We provide an accompanying webpage that catalogues the methods based on our taxonomy, available at: https://github.com/fiveai/FMs-for-driving-trajectories
翻译:多模态基础模型的出现显著变革了自动驾驶技术,从传统且大多为手工设计的方案转向基于基础模型的统一方法,这些方法能够直接从原始感知输入中推断运动轨迹。此类新方法还可将自然语言作为额外模态纳入其中,以视觉-语言-动作(VLA)模型为典型代表。本综述通过统一分类体系对这些方法进行全面审视,批判性评估其架构设计选择、方法优势及内在能力与局限。我们综述了37项近期提出的方法,涵盖基于基础模型的轨迹规划全貌。此外,我们从源代码与数据集开放性角度对这些方法进行评测,为从业者与研究人员提供有价值的信息。我们提供了配套网页,基于分类体系对方法进行编目,访问地址为:https://github.com/fiveai/FMs-for-driving-trajectories