Current on-board chips usually have different computing power, which means multiple training processes are needed for adapting the same learning-based algorithm to different chips, costing huge computing resources. The situation becomes even worse for 3D perception methods with large models. Previous vision-centric 3D perception approaches are trained with regular grid-represented feature maps of fixed resolutions, which is not applicable to adapt to other grid scales, limiting wider deployment. In this paper, we leverage the Polar representation when constructing the BEV feature map from images in order to achieve the goal of training once for multiple deployments. Specifically, the feature along rays in Polar space can be easily adaptively sampled and projected to the feature in Cartesian space with arbitrary resolutions. To further improve the adaptation capability, we make multi-scale contextual information interact with each other to enhance the feature representation. Experiments on a large-scale autonomous driving dataset show that our method outperforms others as for the good property of one training for multiple deployments.
翻译:当前车载芯片通常具有不同的算力,这意味着同一学习算法需通过多次训练才能适配不同芯片,造成大量计算资源消耗。对于采用大模型的3D感知方法,该问题尤为严峻。现有基于视觉的3D感知方法使用固定分辨率的规则网格特征图进行训练,无法适应其他网格尺度,限制了部署范围。本文利用极坐标表示构建图像到BEV特征图的映射,以实现"一次训练、多场景部署"的目标。具体而言,极坐标空间沿射线方向的特征可通过自适应采样和投影,轻松转换为任意分辨率的笛卡尔空间特征。为进一步提升适配能力,我们让多尺度上下文信息相互交互以增强特征表示。在大规模自动驾驶数据集上的实验表明,本方法在"一次训练适配多部署"特性上优于其他方法。