Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with ablative studies validating the complementary roles of reflection-in-action and reflection-on-action. Qualitative analyses, including real-robot trials, highlight behavioral correction through reflection.
翻译:具身化大语言模型赋予机器人高层次的任务推理能力,但它们无法反思错误及其成因,导致部署过程成为一系列独立的试错,错误不断重复而非累积为经验。借鉴人类反思实践者的理念,我们提出了反思性测试时规划,该方法整合了两种反思模式:\textit{行动中反思},智能体在行动前通过测试时缩放,利用内部反思生成并评估多个候选行动;以及\textit{行动后反思},智能体在行动后基于外部反思,通过测试时训练更新其内部反思模型与行动策略。我们还引入了追溯性反思,使智能体能够重新评估先前的决策,并利用后见之明进行模型更新,以实现有效的长时程信用分配。在我们新设计的长时程家庭任务基准与MuJoCo橱柜装配基准上的实验表明,该方法相较于基线模型取得了显著提升,消融研究验证了行动中反思与行动后反思的互补作用。包括真实机器人试验在内的定性分析,突显了通过反思实现的行为修正。