In recent years, researchers have increasingly been interested in how to enable Multimodal Large Language Models (MLLM) to possess spatial understanding and reasoning capabilities. However, most existing methods overlook the importance of the ability to continuously work in an ever-changing world, and lack the possibility of deployment on embodied systems in real-world environments. In this work, we introduce OnlineSI, a framework that can continuously improve its spatial understanding of its surroundings given a video stream. Our core idea is to maintain a finite spatial memory to retain past observations, ensuring the computation required for each inference does not increase as the input accumulates. We further integrate 3D point cloud information with semantic information, helping MLLM to better locate and identify objects in the scene. To evaluate our method, we introduce the Fuzzy $F_1$-Score to mitigate ambiguity, and test our method on two representative datasets. Experiments demonstrate the effectiveness of our method, paving the way towards real-world embodied systems.
翻译:近年来,研究者们日益关注如何使多模态大语言模型具备空间理解与推理能力。然而,现有方法大多忽视了在不断变化的环境中持续工作的能力的重要性,且缺乏在现实世界具身系统中部署的可能性。本工作提出了OnlineSI框架,该框架能够在给定视频流的情况下持续提升其对周围环境的空间理解。我们的核心思想是维护一个有限的空间记忆以保留过去的观测,确保每次推理所需的计算量不会随输入累积而增加。我们进一步将三维点云信息与语义信息相结合,帮助多模态大语言模型更好地定位和识别场景中的物体。为评估本方法,我们引入了模糊$F_1$分数以缓解歧义,并在两个代表性数据集上测试了我们的方法。实验证明了本方法的有效性,为迈向现实世界的具身系统铺平了道路。