Modern approaches for vision-centric environment perception for autonomous navigation make extensive use of self-supervised monocular depth estimation algorithms that output disparity maps. However, when this disparity map is projected onto 3D space, the errors in disparity are magnified, resulting in a depth estimation error that increases quadratically as the distance from the camera increases. Though Light Detection and Ranging (LiDAR) can solve this issue, it is expensive and not feasible for many applications. To address the challenge of accurate ranging with low-cost sensors, we propose, OCTraN, a transformer architecture that uses iterative-attention to convert 2D image features into 3D occupancy features and makes use of convolution and transpose convolution to efficiently operate on spatial information. We also develop a self-supervised training pipeline to generalize the model to any scene by eliminating the need for LiDAR ground truth by substituting it with pseudo-ground truth labels obtained from boosted monocular depth estimation.
翻译:现代自动驾驶视觉环境感知方法广泛采用自监督单目深度估计算法生成视差图。然而,当视差图投影至三维空间时,视差误差会被放大,导致深度估计误差随相机距离增加呈二次增长。尽管激光雷达(LiDAR)可解决此问题,但其成本高昂且不适用于众多应用场景。为解决低成本传感器下的精准测距挑战,我们提出OCTraN——一种基于迭代注意力机制的Transformer架构,可将二维图像特征转换为三维占用特征,并利用卷积与转置卷积高效处理空间信息。同时,我们开发了自监督训练流程,通过引入增强型单目深度估计生成的伪真值标签替代激光雷达真值,使模型无需依赖标注数据即可泛化至任意场景。