Solving complex real-world control tasks often takes multiple tries: if we fail at first, we reflect on what went wrong, and change our strategy accordingly to avoid making the same mistake. In robotics, Vision-Language-Action models (VLAs) offer a promising path towards solving complex control tasks, but lack the ability to contextually and dynamically readjust behavior when they fail to accomplish a task. In this work, we introduce Learning from Inference-Time Execution (LITEN), which connects a VLA low-level policy to a high-level VLM that conditions on past experiences by including them in-context, allowing it to learn the affordances and capabilities of the low-level VLA. Our approach iterates between a reasoning phase that generates and executes plans for the low-level VLA, and an assessment phase that reflects on the resulting execution and draws useful conclusions to be included in future reasoning contexts. Unlike similar approaches to self-refinement in non-robotics domains, LITEN must reflect on unstructured real-world robot trajectories (e.g., raw videos), which requires structured guiderails during assessment. Our experimental results demonstrate LITEN is able to effectively learn from past experience to generate plans that use high-affordance instructions to accomplish long-horizon tasks.
翻译:解决复杂的现实世界控制任务通常需要多次尝试:如果初次失败,我们会反思问题所在,并相应调整策略以避免重蹈覆辙。在机器人学中,视觉-语言-动作模型为求解复杂控制任务提供了有前景的路径,但其缺乏在任务执行失败时进行情境化动态行为调整的能力。本研究提出推理时执行学习,该方法通过将低层VLA策略与高层视觉语言模型相连接,并将过往经验以情境化方式纳入高层模型,使其能够学习低层VLA的可供性与能力边界。我们的方法在推理阶段(为低层VLA生成并执行规划)与评估阶段(对执行结果进行反思并提取有效结论以供后续推理使用)之间迭代循环。与在非机器人领域进行自我优化的类似方法不同,LITEN必须对非结构化的真实世界机器人轨迹进行反思,这要求在评估阶段建立结构化引导机制。实验结果表明,LITEN能够有效利用历史经验生成采用高可供性指令的规划,从而完成长时域任务。