Depth sensing is a crucial function of unmanned aerial vehicles and autonomous vehicles. Due to the small size and simple structure of monocular cameras, there has been a growing interest in depth estimation from a single RGB image. However, state-of-the-art monocular CNN-based depth estimation methods using fairly complex deep neural networks are too slow for real-time inference on embedded platforms. This paper addresses the problem of real-time depth estimation on embedded systems. We propose two efficient and lightweight encoder-decoder network architectures, RT-MonoDepth and RT-MonoDepth-S, to reduce computational complexity and latency. Our methodologies demonstrate that it is possible to achieve similar accuracy as prior state-of-the-art works on depth estimation at a faster inference speed. Our proposed networks, RT-MonoDepth and RT-MonoDepth-S, runs at 18.4\&30.5 FPS on NVIDIA Jetson Nano and 253.0\&364.1 FPS on NVIDIA Jetson AGX Orin on a single RGB image of resolution 640$\times$192, and achieve relative state-of-the-art accuracy on the KITTI dataset. To the best of the authors' knowledge, this paper achieves the best accuracy and fastest inference speed compared with existing fast monocular depth estimation methods.
翻译:深度感知是无人飞行器与自动驾驶车辆的核心功能。由于单目相机体积小、结构简单,从单张RGB图像进行深度估计的研究日益受到关注。然而,当前基于复杂深度神经网络的单目CNN深度估计方法在嵌入式平台上因推理速度过慢而无法满足实时需求。本文针对嵌入式系统上的实时深度估计问题展开研究,提出两种高效轻量级编码器-解码器网络架构——RT-MonoDepth与RT-MonoDepth-S,以降低计算复杂度与延迟。实验表明,所提方法可在保持接近现有最优深度估计精度的同时,实现更快的推理速度。在NVIDIA Jetson Nano与Jetson AGX Orin平台上,针对分辨率640×192的单张RGB图像,RT-MonoDepth与RT-MonoDepth-S的推理速度分别达到18.4/30.5 FPS与253.0/364.1 FPS,并在KITTI数据集上取得相对最优精度。据作者所知,本文在现有快速单目深度估计方法中实现了最优精度与最快推理速度。