Vision-centric joint perception and prediction (PnP) has become an emerging trend in autonomous driving research. It predicts the future states of the traffic participants in the surrounding environment from raw RGB images. However, it is still a critical challenge to synchronize features obtained at multiple camera views and timestamps due to inevitable geometric distortions and further exploit those spatial-temporal features. To address this issue, we propose a temporal bird's-eye-view pyramid transformer (TBP-Former) for vision-centric PnP, which includes two novel designs. First, a pose-synchronized BEV encoder is proposed to map raw image inputs with any camera pose at any time to a shared and synchronized BEV space for better spatial-temporal synchronization. Second, a spatial-temporal pyramid transformer is introduced to comprehensively extract multi-scale BEV features and predict future BEV states with the support of spatial-temporal priors. Extensive experiments on nuScenes dataset show that our proposed framework overall outperforms all state-of-the-art vision-based prediction methods.
翻译:以视觉为中心的联合感知与预测已成为自动驾驶研究的新兴趋势。该方法从原始RGB图像预测周围环境中交通参与者的未来状态。然而,由于不可避免的几何畸变以及后续时空特征利用的困难,如何有效同步多相机视角和多时间戳下的特征仍是关键挑战。为解决此问题,我们提出面向视觉为中心的联合感知与预测的时序鸟瞰金字塔变换器(TBP-Former),其包含两项创新设计:首先,提出姿态同步的鸟瞰编码器,将任意时刻、任意相机姿态的原始图像输入映射到共享且同步的鸟瞰空间,以实现更优的时空同步;其次,引入时空金字塔变换器,在时空先验支持下全面提取多尺度鸟瞰特征并预测未来鸟瞰状态。在nuScenes数据集上的大量实验表明,我们提出的框架整体性能优于所有当前最先进的基于视觉的预测方法。