EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

Large foundation models have made significant advances in embodied intelligence, enabling synthesis and reasoning over egocentric input for household tasks. However, VLM-based auto-labeling is often noisy because the primary data sources lack accurate human action labels, chain-of-thought (CoT), and spatial annotations; these errors are amplified during long-horizon spatial instruction following. These issues stem from insufficient coverage of minute-long, daily household planning tasks and from inaccurate spatial grounding. As a result, VLM reasoning chains and world-model synthesis can hallucinate objects, skip steps, or fail to respect real-world physical attributes. To address these gaps, we introduce EgoTL. EgoTL builds a think-aloud capture pipeline for egocentric data. It uses a say-before-act protocol to record step-by-step goals and spoken reasoning with word-level timestamps, then calibrates physical properties with metric-scale spatial estimators, a memory-bank walkthrough for scene context, and clip-level tags for navigation instructions and detailed manipulation actions. With EgoTL, we are able to benchmark VLMs and World Models on six task dimensions from three layers and long-horizon generation over minute-long sequences across over 100 daily household tasks. We find that foundation models still fall short as egocentric assistants or open-world simulators. Finally, we finetune foundation models with human CoT aligned with metric labels on the training split of EgoTL, which improves long-horizon planning and reasoning, step-wise reasoning, instruction following, and spatial grounding.

翻译：大型基础模型在具身智能领域取得了显著进展，能够基于自我中心输入完成家庭任务的综合推理。然而，基于VLM的自动标注常存在噪声问题，因为主要数据源缺乏精确的人类动作标签、思维链及空间标注；这些误差在长时空间指令跟随过程中被进一步放大。此类问题源于对分钟级日常家庭规划任务的覆盖不足以及空间定位不准确。因此，VLM推理链与世界模型合成可能出现物体幻觉、步骤遗漏或无法遵守真实物理属性。针对这些不足，我们提出EgoTL。EgoTL构建了一个针对自我中心数据的出声思维采集流水线，采用"先说后做"协议，以单词级时间戳记录逐步骤目标与口语推理，进而通过公制尺度空间估计器校准物理属性，借助记忆库遍历场景上下文，并利用片段级标签处理导航指令与精细操作动作。借助EgoTL，我们能够在三个层级、六大任务维度上评估VLM与世界模型，并在涵盖100余项日常家庭任务、分钟级序列的长时生成任务中进行测试。研究发现，基础模型在作为自我中心助理或开放世界模拟器时仍存在不足。最后，我们利用与公制标签对齐的人工思维链对EgoTL训练集上的基础模型进行微调，从而提升了长时规划推理、逐步推理、指令跟随及空间定位能力。