RoboPIN: Grounded Embodied Reasoning via Pinned Chain-of-Thought

Embodied reasoning requires models to perceive task-relevant objects and spaces in physical environments and maintain consistent visual grounding throughout multi-step reasoning. However, current vision-language models rely on text-only or coordinate-augmented chain-of-thought, where entity references remain implicit and ambiguous. This may cause the reasoning process to decouple from visual evidence, entity references to drift across steps, and a causal disconnection between the reasoning trajectory and the final answer, with these problems further amplified in multi-view scenarios due to cross-view appearance changes. To address these issues, we propose Pinned Chain-of-Thought (\pincot{}), a structured reasoning paradigm that pins every reasoning step to visual evidence. \pincot{} introduces the concept of \reasoninganchor{}, which binds each task-relevant entity to a structured visual anchor with entity name, unique identity, view index, and spatial grounding, enabling consistent entity tracking across reasoning steps and views. We build a fully automated data generation pipeline to construct \dataset{}, a high-quality \pincot{}-formatted reasoning dataset. We then train \method{} through three-stage post-training that progressively injects embodied knowledge, structured reasoning ability, and process-supervised alignment, with rewards that directly constrain both anchor localization and identity consistency during reasoning. On 14 benchmarks covering embodied spatial reasoning, multi-view reasoning, and pointing, \method{} with only 4B parameters consistently outperforms 7B level open-source embodied models, achieving a 12\% average improvement over the strongest 7B baseline, Mimo-Embodied. Further analysis shows that \pincot{} improves grounding accuracy and cross-step identity consistency, validating the effectiveness of process supervision.

翻译：[translated abstract in Chinese] 具身推理要求模型能够感知物理环境中与任务相关的物体和空间，并在多步推理过程中保持一致的视觉接地。然而，当前的视觉-语言模型依赖纯文本或坐标增强的思维链方法，其中实体引用隐含且模糊。这可能导致推理过程与视觉证据脱钩、跨步骤实体引用漂移、以及推理轨迹与最终答案之间存在因果断裂，这些问题在多视角场景中会因跨视角外观变化而进一步放大。为解决上述问题，我们提出固定思维链（\pincot{}），一种将每个推理步骤固着于视觉证据的结构化推理范式。\pincot{}引入推理锚点概念，通过实体名称、唯一标识、视角索引和空间定位将每个任务相关实体绑定到结构化视觉锚点，实现跨推理步骤和视角的实体一致性追踪。我们构建全自动数据生成流水线，生成高质量\pincot{}格式推理数据集\dataset{}。随后通过三阶段后训练训练\method{}模型，逐步注入具身知识、结构化推理能力和过程监督对齐，奖励函数直接约束推理过程中的锚点定位准确性及身份一致性。在涵盖具身空间推理、多视角推理和指向的14个基准测试中，仅4B参数的\method{}持续优于7B级别的开源具身模型，较最强7B基线Mimo-Embodied实现12%的平均性能提升。进一步分析表明，\pincot{}提升了接地准确性与跨步骤身份一致性，验证了过程监督的有效性。

相关内容

实体

关注 12

实体（entity）是有可区别性且独立存在的某种事物，但它不需要是物质上的存在。尤其是抽象和法律拟制也通常被视为实体。实体可被看成是一包含有子集的集合。在哲学里，这种集合被称为客体。实体可被使用来指涉某个可能是人、动物、植物或真菌等不会思考的生命、无生命物体或信念等的事物。在这一方面，实体可以被视为一全包的词语。有时，实体被当做本质的广义，不论即指的是否为物质上的存在，如时常会指涉到的无物质形式的实体－语言。更有甚者，实体有时亦指存在或本质本身。在法律上，实体是指能具有权利和义务的事物。这通常是指法人，但也包括自然人。

大语言模型的智能体化推理

专知会员服务

35+阅读 · 1月21日

多模态推理的基础、方法与未来前沿

专知会员服务

27+阅读 · 2025年7月6日

超越语言的推理：潜在思维链推理的综合综述

专知会员服务

22+阅读 · 2025年5月23日

【博士论文】推理的表示学习：跨多样结构的泛化

专知会员服务

27+阅读 · 2024年10月20日