Perception and prediction are two separate modules in the existing autonomous driving systems. They interact with each other via hand-picked features such as agent bounding boxes and trajectories. Due to this separation, prediction, as a downstream module, only receives limited information from the perception module. To make matters worse, errors from the perception modules can propagate and accumulate, adversely affecting the prediction results. In this work, we propose ViP3D, a query-based visual trajectory prediction pipeline that exploits rich information from raw videos to directly predict future trajectories of agents in a scene. ViP3D employs sparse agent queries to detect, track, and predict throughout the pipeline, making it the first fully differentiable vision-based trajectory prediction approach. Instead of using historical feature maps and trajectories, useful information from previous timestamps is encoded in agent queries, which makes ViP3D a concise streaming prediction method. Furthermore, extensive experimental results on the nuScenes dataset show the strong vision-based prediction performance of ViP3D over traditional pipelines and previous end-to-end models.
翻译:感知与预测是现有自动驾驶系统中两个独立的模块。它们通过手工设计的特征(如智能体边界框和轨迹)进行交互。由于这种分离,作为下游模块的预测仅能从感知模块接收有限的信息。更糟糕的是,感知模块的误差可能会传播与累积,对预测结果产生不利影响。本文提出ViP3D——一种基于查询的视觉轨迹预测流水线,可利用原始视频中的丰富信息直接预测场景中智能体的未来轨迹。ViP3D在整个流水线中采用稀疏智能体查询实现检测、跟踪与预测,成为首个全可微分的视觉轨迹预测方法。与使用历史特征图及轨迹不同,先前时间戳的有用信息被编码在智能体查询中,使ViP3D成为简洁的流式预测方法。此外,在nuScenes数据集上的大量实验结果表明,ViP3D相比传统流水线和现有端到端模型展现出强大的基于视觉的预测性能。