Depth estimation is an important task in various robotics systems and applications. In mobile robotics systems, monocular depth estimation is desirable since a single RGB camera can be deployable at a low cost and compact size. Due to its significant and growing needs, many lightweight monocular depth estimation networks have been proposed for mobile robotics systems. While most lightweight monocular depth estimation methods have been developed using convolution neural networks, the Transformer has been gradually utilized in monocular depth estimation recently. However, massive parameters and large computational costs in the Transformer disturb the deployment to embedded devices. In this paper, we present a Token-Sharing Transformer (TST), an architecture using the Transformer for monocular depth estimation, optimized especially in embedded devices. The proposed TST utilizes global token sharing, which enables the model to obtain an accurate depth prediction with high throughput in embedded devices. Experimental results show that TST outperforms the existing lightweight monocular depth estimation methods. On the NYU Depth v2 dataset, TST can deliver depth maps up to 63.4 FPS in NVIDIA Jetson nano and 142.6 FPS in NVIDIA Jetson TX2, with lower errors than the existing methods. Furthermore, TST achieves real-time depth estimation of high-resolution images on Jetson TX2 with competitive results.
翻译:深度估计是各种机器人系统及应用中的重要任务。在移动机器人系统中,由于单个RGB摄像头具有低成本和小尺寸的优点,单目深度估计显得尤为必要。鉴于其显著且日益增长的需求,研究者已提出众多面向移动机器人系统的轻量级单目深度估计网络。尽管大多数轻量级单目深度估计方法基于卷积神经网络开发,但Transformer近年来也逐渐被应用于单目深度估计领域。然而,Transformer中庞大的参数量和巨大的计算成本阻碍了其在嵌入式设备上的部署。本文提出了一种令牌共享Transformer(TST),这是一种专为嵌入式设备优化的基于Transformer的单目深度估计架构。所提出的TST采用全局令牌共享机制,使模型能够在嵌入式设备中以高吞吐量实现精确的深度预测。实验结果表明,TST在性能上优于现有轻量级单目深度估计方法。在NYU Depth v2数据集上,TST在NVIDIA Jetson nano上可达到63.4 FPS的深度图输出速度,在NVIDIA Jetson TX2上可达142.6 FPS,且误差低于现有方法。此外,TST在Jetson TX2上对高分辨率图像实现了具有竞争力的实时深度估计。