Embodied scene understanding serves as the cornerstone for autonomous agents to perceive, interpret, and respond to open driving scenarios. Such understanding is typically founded upon Vision-Language Models (VLMs). Nevertheless, existing VLMs are restricted to the 2D domain, devoid of spatial awareness and long-horizon extrapolation proficiencies. We revisit the key aspects of autonomous driving and formulate appropriate rubrics. Hereby, we introduce the Embodied Language Model (ELM), a comprehensive framework tailored for agents' understanding of driving scenes with large spatial and temporal spans. ELM incorporates space-aware pre-training to endow the agent with robust spatial localization capabilities. Besides, the model employs time-aware token selection to accurately inquire about temporal cues. We instantiate ELM on the reformulated multi-faced benchmark, and it surpasses previous state-of-the-art approaches in all aspects. All code, data, and models will be publicly shared.
翻译:具身场景理解是自主代理感知、解读并响应开放驾驶场景的基石。此类理解通常建立在视觉-语言模型(VLMs)之上。然而,现有VLMs局限于二维领域,缺乏空间感知与长时域外推能力。我们重新审视自动驾驶的核心要素,并制定相应的评估准则。据此,我们提出具身语言模型(ELM)——一个专为代理理解大空间与时域跨度驾驶场景而设计的综合框架。ELM通过空间感知预训练赋予代理强大的空间定位能力,同时采用时域感知令牌选择精准获取时序线索。我们在重构的多维度基准测试上实例化ELM,其所有性能指标均超越先前最先进方法。所有代码、数据及模型将公开发布。