Multi-camera Bird's Eye View Perception for Autonomous Driving

Most automated driving systems comprise a diverse sensor set, including several cameras, Radars, and LiDARs, ensuring a complete 360\deg coverage in near and far regions. Unlike Radar and LiDAR, which measure directly in 3D, cameras capture a 2D perspective projection with inherent depth ambiguity. However, it is essential to produce perception outputs in 3D to enable the spatial reasoning of other agents and structures for optimal path planning. The 3D space is typically simplified to the BEV space by omitting the less relevant Z-coordinate, which corresponds to the height dimension.The most basic approach to achieving the desired BEV representation from a camera image is IPM, assuming a flat ground surface. Surround vision systems that are pretty common in new vehicles use the IPM principle to generate a BEV image and to show it on display to the driver. However, this approach is not suited for autonomous driving since there are severe distortions introduced by this too-simplistic transformation method. More recent approaches use deep neural networks to output directly in BEV space. These methods transform camera images into BEV space using geometric constraints implicitly or explicitly in the network. As CNN has more context information and a learnable transformation can be more flexible and adapt to image content, the deep learning-based methods set the new benchmark for BEV transformation and achieve state-of-the-art performance. First, this chapter discusses the contemporary trends of multi-camera-based DNN (deep neural network) models outputting object representations directly in the BEV space. Then, we discuss how this approach can extend to effective sensor fusion and coupling downstream tasks like situation analysis and prediction. Finally, we show challenges and open problems in BEV perception.

翻译：大多数自动驾驶系统包含多种传感器，包括多个摄像头、雷达和激光雷达，以确保在近处和远处实现完整的360°覆盖。与直接测量3D信息的雷达和激光雷达不同，摄像头捕捉的是具有固有深度模糊性的2D透视投影。然而，为了实现最优路径规划，必须对周围智能体和结构进行空间推理，因此需要在3D空间中生成感知输出。3D空间通常通过忽略相关性较低的Z坐标（高度维度）简化为BEV空间。从摄像头图像获得所需BEV表示的最基本方法是IPM，它假设地面为平坦表面。现代车辆中常见的环视系统利用IPM原理生成BEV图像并显示给驾驶员。然而，这种方法不适合自动驾驶，因为这种过于简化的变换方法会引入严重畸变。更近期的方法使用深度神经网络直接在BEV空间中输出结果。这些方法通过在网络中显式或隐式利用几何约束，将摄像头图像变换到BEV空间。由于CNN具有更多上下文信息，且可学习的变换更加灵活并能适应图像内容，基于深度学习的方法为BEV变换设立了新基准，并实现了最先进的性能。本章首先讨论基于多摄像头DNN（深度神经网络）模型直接在BEV空间输出对象表示的当代趋势。随后，我们探讨这种方法如何扩展到有效的传感器融合，并与下游任务（如情景分析和预测）耦合。最后，我们展示了BEV感知中的挑战和开放性问题。