Fast-BEV: A Fast and Strong Bird's-Eye View Perception Baseline

Recently, perception task based on Bird's-Eye View (BEV) representation has drawn more and more attention, and BEV representation is promising as the foundation for next-generation Autonomous Vehicle (AV) perception. However, most existing BEV solutions either require considerable resources to execute on-vehicle inference or suffer from modest performance. This paper proposes a simple yet effective framework, termed Fast-BEV , which is capable of performing faster BEV perception on the on-vehicle chips. Towards this goal, we first empirically find that the BEV representation can be sufficiently powerful without expensive transformer based transformation nor depth representation. Our Fast-BEV consists of five parts, We novelly propose (1) a lightweight deployment-friendly view transformation which fast transfers 2D image feature to 3D voxel space, (2) an multi-scale image encoder which leverages multi-scale information for better performance, (3) an efficient BEV encoder which is particularly designed to speed up on-vehicle inference. We further introduce (4) a strong data augmentation strategy for both image and BEV space to avoid over-fitting, (5) a multi-frame feature fusion mechanism to leverage the temporal information. Through experiments, on 2080Ti platform, our R50 model can run 52.6 FPS with 47.3% NDS on the nuScenes validation set, exceeding the 41.3 FPS and 47.5% NDS of the BEVDepth-R50 model and 30.2 FPS and 45.7% NDS of the BEVDet4D-R50 model. Our largest model (R101@900x1600) establishes a competitive 53.5% NDS on the nuScenes validation set. We further develop a benchmark with considerable accuracy and efficiency on current popular on-vehicle chips. The code is released at: https://github.com/Sense-GVT/Fast-BEV.

翻译：近年来，基于鸟瞰图（BEV）表示的感知任务受到越来越多的关注，BEV表示有望成为下一代自动驾驶车辆（AV）感知的基础。然而，现有的大多数BEV解决方案要么需要大量资源来执行车载推理，要么性能表现平平。本文提出了一种简单而有效的框架，称为Fast-BEV，它能够在车载芯片上执行更快的BEV感知。为此，我们首先通过实证研究发现，BEV表示无需昂贵的基于Transformer的转换或深度表示即可具备足够强大的能力。我们的Fast-BEV包含五个部分：我们创新性地提出了（1）一种轻量级且易于部署的视图转换方法，能够快速将2D图像特征转换到3D体素空间；（2）一种多尺度图像编码器，利用多尺度信息以获得更好的性能；（3）一种高效的BEV编码器，专门设计用于加速车载推理。我们进一步引入了（4）一种针对图像和BEV空间的强大数据增强策略，以避免过拟合；（5）一种多帧特征融合机制，以利用时序信息。实验表明，在2080Ti平台上，我们的R50模型在nuScenes验证集上可以达到52.6 FPS的运行速度，并获得47.3%的NDS分数，超过了BEVDepth-R50模型的41.3 FPS和47.5% NDS，以及BEVDet4D-R50模型的30.2 FPS和45.7% NDS。我们最大的模型（R101@900x1600）在nuScenes验证集上取得了具有竞争力的53.5% NDS分数。我们进一步在当前流行的车载芯片上建立了一个兼顾精度与效率的基准。代码发布于：https://github.com/Sense-GVT/Fast-BEV。