Language-driven object navigation requires agents to interpret natural language descriptions of target objects, which combine intrinsic and extrinsic attributes for instance recognition and commonsense navigation. Existing methods either (i) use end-to-end trained models with vision-language embeddings, which struggle to generalize beyond training data and lack action-level explainability, or (ii) rely on modular zero-shot pipelines with large language models (LLMs) and open-set object detectors, which suffer from error propagation, high computational cost, and difficulty integrating their reasoning back into the navigation policy. To this end, we propose a compact 3B-parameter Vision-Language-Action (VLA) agent that performs human-like embodied reasoning for both object recognition and action selection, removing the need for stitched multi-model pipelines. Instead of raw embedding matching, our agent employs explicit image-grounded reasoning to directly answer "Is this the target object?" and "Why should I take this action?" The reasoning process unfolds in three stages: "think", "think summary", and "action", yielding improved explainability, stronger generalization, and more efficient navigation. Code and dataset available upon acceptance.
翻译:语言驱动目标导航要求智能体能够解读目标物体的自然语言描述,这些描述结合了内在与外在属性以实现实例识别与常识性导航。现有方法存在两类局限:(i) 采用端到端训练的视觉-语言嵌入模型,此类方法难以泛化至训练数据之外,且缺乏动作层面的可解释性;(ii) 依赖基于大语言模型(LLMs)与开放集物体检测器的模块化零样本流水线,此类方法存在误差传播、计算成本高昂以及难以将推理结果整合回导航策略的问题。为此,我们提出一个紧凑的30亿参数视觉-语言-动作(VLA)智能体,其通过类人的具身推理同时完成物体识别与动作选择,从而无需构建多模型拼接流水线。相较于原始嵌入匹配,本智能体采用显式的图像锚定推理来直接回答“这是目标物体吗?”与“为何应执行此动作?”。推理过程包含“思考”“思考总结”与“动作”三个阶段,在可解释性、泛化能力与导航效率方面均获得显著提升。代码与数据集将在论文录用后公开。