Indoor mobile robot navigation requires fast responsiveness and robust semantic understanding, yet existing methods struggle to provide both. Classical geometric approaches such as SLAM offer reliable localization but depend on detailed maps and cannot interpret human-targeted cues (e.g., signs, room numbers) essential for indoor reasoning. Vision-Language-Action (VLA) models introduce semantic grounding but remain strictly reactive, basing decisions only on visible frames and failing to anticipate unseen intersections or reason about distant textual cues. Vision-Language Models (VLMs) provide richer contextual inference but suffer from high computational latency, making them unsuitable for real-time operation on embedded platforms. In this work, we present IROS, a real-time navigation framework that combines VLM-level contextual reasoning with the efficiency of lightweight perceptual modules on low-cost, on-device hardware. Inspired by Dual Process Theory, IROS separates fast reflexive decisions (System One) from slow deliberative reasoning (System Two), invoking the VLM only when necessary. Furthermore, by augmenting compact VLMs with spatial and textual cues, IROS delivers robust, human-like navigation with minimal latency. Across five real-world buildings, IROS improves decision accuracy and reduces latency by 66% compared to continuous VLM-based navigation.
翻译:室内移动机器人导航需要快速响应能力和鲁棒的语义理解能力,然而现有方法难以同时满足这两点。经典的几何方法(如SLAM)能提供可靠的定位,但依赖于详细的地图,且无法解析室内推理所需的人为指向性线索(例如标识、房间号)。视觉-语言-动作(VLA)模型引入了语义基础,但仍严格局限于反应式决策,仅依据可见帧做出判断,无法预判未见的交叉路口或推理远处的文本线索。视觉-语言模型(VLM)能提供更丰富的上下文推理,但存在计算延迟高的问题,使其不适用于嵌入式平台上的实时操作。本文提出IROS,一种实时导航框架,它结合了VLM级的上下文推理能力与轻量级感知模块在低成本设备硬件上的运行效率。受双过程理论启发,IROS将快速反射性决策(系统一)与缓慢的审慎推理(系统二)分离,仅在必要时调用VLM。此外,通过为紧凑型VLM增强空间与文本线索,IROS能以最小延迟实现鲁棒的、类人化的导航。在五个真实建筑场景中的实验表明,与基于VLM的连续导航相比,IROS将决策准确率提升了66%,并显著降低了延迟。