Deep-learning and large scale language-image training have produced image object detectors that generalise well to diverse environments and semantic classes. However, single-image object detectors trained on internet data are not optimally tailored for the embodied conditions inherent in robotics. Instead, robots must detect objects from complex multi-modal data streams involving depth, localisation and temporal correlation, a task termed embodied object detection. Paradigms such as Video Object Detection (VOD) and Semantic Mapping have been proposed to leverage such embodied data streams, but existing work fails to enhance performance using language-image training. In response, we investigate how an image object detector pre-trained using language-image data can be extended to perform embodied object detection. We propose a novel implicit object memory that uses projective geometry to aggregate the features of detected objects across long temporal horizons. The spatial and temporal information accumulated in memory is then used to enhance the image features of the base detector. When tested on embodied data streams sampled from diverse indoor scenes, our approach improves the base object detector by 3.09 mAP, outperforming alternative external memories designed for VOD and Semantic Mapping. Our method also shows a significant improvement of 16.90 mAP relative to baselines that perform embodied object detection without first training on language-image data, and is robust to sensor noise and domain shift experienced in real-world deployment.
翻译:深度学习与大规模语言-图像训练已使图像目标检测器能够泛化至多样环境与语义类别。然而,基于互联网数据训练的单幅图像检测器并未针对机器人学中固有的具身条件进行最优适配。机器人需从多模态数据流(含深度、定位及时序关联)中检测目标,此任务定义为具身目标检测。视频目标检测(VOD)与语义建图等范式已被提出以利用此类具身数据流,但现有研究未能通过语言-图像训练提升性能。为此,我们探究如何将经语言-图像数据预训练的图像目标检测器扩展至具身目标检测任务。我们提出一种新型隐式目标记忆机制,通过射影几何在长时域跨度上聚合检测目标的特征,并利用记忆积累的时空信息增强基检测器的图像特征。在多样室内场景的具身数据流测试中,本方法使基检测器性能提升3.09 mAP,优于为VOD与语义建图设计的替代性外部记忆机制。相较于未经过语言-图像数据训练的具身目标检测基线,本方法显著提升16.90 mAP,且对真实部署中的传感器噪声与域偏移具有鲁棒性。