Transfomer-based approaches advance the recent development of multi-camera 3D detection both in academia and industry. In a vanilla transformer architecture, queries are randomly initialised and optimised for the whole dataset, without considering the differences among input frames. In this work, we propose to leverage the predictions from an image backbone, which is often highly optimised for 2D tasks, as priors to the transformer part of a 3D detection network. The method works by (1). augmenting image feature maps with 2D priors, (2). sampling query locations via ray-casting along 2D box centroids, as well as (3). initialising query features with object-level image features. Experimental results shows that 2D priors not only help the model converge faster, but also largely improve the baseline approach by up to 12% in terms of average precision.
翻译:基于Transformer的方法推动了多摄像头三维检测在学术界和工业界的最新发展。在标准Transformer架构中,查询向量被随机初始化并针对整体数据集进行优化,未考虑输入帧之间的差异。本文提出将图像主干网络(通常针对二维任务高度优化)的预测结果作为三维检测网络Transformer部分的先验信息。该方法通过以下步骤实现:(1) 利用二维先验增强图像特征图;(2) 沿二维框质心进行射线投射来采样查询位置;(3) 利用目标级图像特征初始化查询特征。实验结果表明,二维先验不仅帮助模型更快收敛,还将基准方法的平均精度提升了最高12%。