Scene analysis is essential for enabling autonomous systems, such as mobile robots, to operate in real-world environments. However, obtaining a comprehensive understanding of the scene requires solving multiple tasks, such as panoptic segmentation, instance orientation estimation, and scene classification. Solving these tasks given limited computing and battery capabilities on mobile platforms is challenging. To address this challenge, we introduce an efficient multi-task scene analysis approach, called EMSAFormer, that uses an RGB-D Transformer-based encoder to simultaneously perform the aforementioned tasks. Our approach builds upon the previously published EMSANet. However, we show that the dual CNN-based encoder of EMSANet can be replaced with a single Transformer-based encoder. To achieve this, we investigate how information from both RGB and depth data can be effectively incorporated in a single encoder. To accelerate inference on robotic hardware, we provide a custom NVIDIA TensorRT extension enabling highly optimization for our EMSAFormer approach. Through extensive experiments on the commonly used indoor datasets NYUv2, SUNRGB-D, and ScanNet, we show that our approach achieves state-of-the-art performance while still enabling inference with up to 39.1 FPS on an NVIDIA Jetson AGX Orin 32 GB.
翻译:场景分析对于使自主系统(如移动机器人)能够在现实环境中运行至关重要。然而,要获得对场景的全面理解,需要解决多个任务,例如全景分割、实例方向估计和场景分类。在移动平台有限的计算和电池能力下解决这些任务具有挑战性。为应对这一挑战,我们提出了一种高效的多任务场景分析方法,称为EMSAFormer,该方法使用基于RGB-D Transformer的编码器来同时执行上述任务。我们的方法建立在先前发表的EMSANet基础上。然而,我们证明了EMSANet的双CNN编码器可以被单一的基于Transformer的编码器所取代。为实现这一点,我们研究了如何将来自RGB和深度数据的信息有效地整合到单一编码器中。为了在机器人硬件上加速推理,我们提供了一个定制的NVIDIA TensorRT扩展,为我们的EMSAFormer方法实现了高度优化。通过在常用室内数据集NYUv2、SUNRGB-D和ScanNet上进行的大量实验,我们表明,我们的方法实现了最先进的性能,同时仍能在NVIDIA Jetson AGX Orin 32 GB上实现高达39.1 FPS的推理速度。