Talk2BEV is a large vision-language model (LVLM) interface for bird's-eye view (BEV) maps in autonomous driving contexts. While existing perception systems for autonomous driving scenarios have largely focused on a pre-defined (closed) set of object categories and driving scenarios, Talk2BEV blends recent advances in general-purpose language and vision models with BEV-structured map representations, eliminating the need for task-specific models. This enables a single system to cater to a variety of autonomous driving tasks encompassing visual and spatial reasoning, predicting the intents of traffic actors, and decision-making based on visual cues. We extensively evaluate Talk2BEV on a large number of scene understanding tasks that rely on both the ability to interpret free-form natural language queries, and in grounding these queries to the visual context embedded into the language-enhanced BEV map. To enable further research in LVLMs for autonomous driving scenarios, we develop and release Talk2BEV-Bench, a benchmark encompassing 1000 human-annotated BEV scenarios, with more than 20,000 questions and ground-truth responses from the NuScenes dataset.
翻译:Talk2BEV是一种面向自动驾驶场景中鸟瞰图(BEV)的大型视觉语言模型(LVLM)接口。现有的自动驾驶感知系统主要聚焦于预定义(封闭)的目标类别和驾驶场景,而Talk2BEV将近期通用语言与视觉模型的进展与BEV结构化地图表示相融合,消除了对特定任务模型的需求。这使得单一系统能够应对多种自动驾驶任务,涵盖视觉与空间推理、预测交通参与者意图,以及基于视觉线索的决策制定。我们在大规模场景理解任务上对Talk2BEV进行了全面评估,这些任务既依赖于解析自由形式自然语言查询的能力,也依赖于将这些查询与嵌入语言增强BEV地图的视觉上下文进行 grounding 的能力。为促进自动驾驶场景中LVLMs的进一步研究,我们开发并发布了Talk2BEV-Bench基准数据集,该基准包含来自NuScenes数据集的1000个人工标注的BEV场景,以及超过20000个问题及其对应的真实答案。