A new trend in the computer vision community is to capture objects of interest following flexible human command represented by a natural language prompt. However, the progress of using language prompts in driving scenarios is stuck in a bottleneck due to the scarcity of paired prompt-instance data. To address this challenge, we propose the first object-centric language prompt set for driving scenes within 3D, multi-view, and multi-frame space, named NuPrompt. It expands Nuscenes dataset by constructing a total of 35,367 language descriptions, each referring to an average of 5.3 object tracks. Based on the object-text pairs from the new benchmark, we formulate a new prompt-based driving task, \ie, employing a language prompt to predict the described object trajectory across views and frames. Furthermore, we provide a simple end-to-end baseline model based on Transformer, named PromptTrack. Experiments show that our PromptTrack achieves impressive performance on NuPrompt. We hope this work can provide more new insights for the autonomous driving community. Dataset and Code will be made public at \href{https://github.com/wudongming97/Prompt4Driving}{https://github.com/wudongming97/Prompt4Driving}.
翻译:计算机视觉领域的一个新趋势是,通过自然语言提示所表示的灵活人类指令来捕捉感兴趣的目标。然而,由于配对提示-实例数据的稀缺,语言提示在驾驶场景中的应用进展陷入瓶颈。为解决这一挑战,我们首次提出了面向驾驶场景的、在3D、多视角和多帧空间中以目标为中心的语言提示集,命名为NuPrompt。它通过构建总计35,367条语言描述扩展了Nuscenes数据集,每条描述平均涉及5.3个目标轨迹。基于新基准中的目标-文本配对,我们定义了一项新的基于提示的驾驶任务,即利用语言提示预测所描述目标在跨视角和跨帧中的轨迹。此外,我们提供了一个基于Transformer的简单端到端基线模型,命名为PromptTrack。实验表明,我们的PromptTrack在NuPrompt上取得了令人印象深刻的性能。我们希望这项工作能为自动驾驶社区带来更多新见解。数据集和代码将在https://github.com/wudongming97/Prompt4Driving上公开。