In domestic environments, robots require a comprehensive understanding of their surroundings to interact effectively and intuitively with untrained humans. In this paper, we propose DVEFormer - an efficient RGB-D Transformer-based approach that predicts dense text-aligned visual embeddings (DVE) via knowledge distillation. Instead of directly performing classical semantic segmentation with fixed predefined classes, our method uses teacher embeddings from Alpha-CLIP to guide our efficient student model DVEFormer in learning fine-grained pixel-wise embeddings. While this approach still enables classical semantic segmentation, e.g., via linear probing, it further enables flexible text-based querying and other applications, such as creating comprehensive 3D maps. Evaluations on common indoor datasets demonstrate that our approach achieves competitive performance while meeting real-time requirements, operating at 26.3 FPS for the full model and 77.0 FPS for a smaller variant on an NVIDIA Jetson AGX Orin. Additionally, we show qualitative results that highlight the effectiveness and possible use cases in real-world applications. Overall, our method serves as a drop-in replacement for traditional segmentation approaches while enabling flexible natural-language querying and seamless integration into 3D mapping pipelines for mobile robotics.
翻译:在家庭环境中,机器人需要全面理解周围环境,以便与未经训练的人类进行有效且直观的交互。本文提出DVEFormer——一种基于RGB-D Transformer的高效方法,通过知识蒸馏预测稠密文本对齐的视觉嵌入。与直接使用固定预定义类别进行传统语义分割不同,我们的方法利用Alpha-CLIP生成的教师嵌入来指导高效学生模型DVEFormer学习细粒度像素级嵌入。该方法在保留传统语义分割能力(例如通过线性探测实现)的同时,进一步支持灵活的基于文本的查询及其他应用(如构建综合3D地图)。在常见室内数据集上的评估表明,我们的方法在满足实时性要求的同时实现了具有竞争力的性能:完整模型在NVIDIA Jetson AGX Orin平台上达到26.3 FPS,轻量化版本达到77.0 FPS。此外,我们展示了凸显实际应用有效性与潜在用例的定性结果。总体而言,本方法可作为传统分割方案的即插即用替代方案,同时支持灵活的自然语言查询,并能无缝集成到移动机器人的3D建图流程中。