Pre-trained Vision-Language-Action (VLA) models represent a major leap towards general-purpose robots, yet efficiently adapting them to novel, specific tasks in-situ remains a significant hurdle. While reinforcement learning (RL) is a promising avenue for such adaptation, the process often suffers from low efficiency, hindering rapid task mastery. We introduce Reflective Self-Adaptation, a framework for rapid, autonomous task adaptation without human intervention. Our framework establishes a self-improving loop where the agent learns from its own experience to enhance both strategy and execution. The core of our framework is a dual-pathway architecture that addresses the full adaptation lifecycle. First, a Failure-Driven Reflective RL pathway enables rapid learning by using the VLM's causal reasoning to automatically synthesize a targeted, dense reward function from failure analysis. This provides a focused learning signal that significantly accelerates policy exploration. However, optimizing such proxy rewards introduces a potential risk of "reward hacking," where the agent masters the reward function but fails the actual task. To counteract this, our second pathway, Success-Driven Quality-Guided SFT, grounds the policy in holistic success. It identifies and selectively imitates high-quality successful trajectories, ensuring the agent remains aligned with the ultimate task goal. This pathway is strengthened by a conditional curriculum mechanism to aid initial exploration. We conduct experiments in challenging manipulation tasks. The results demonstrate that our framework achieves faster convergence and higher final success rates compared to representative baselines. Our work presents a robust solution for creating self-improving agents that can efficiently and reliably adapt to new environments.
翻译:预训练的视觉-语言-动作模型代表了迈向通用机器人的重大飞跃,然而如何高效地使其在实地适应新颖的特定任务仍然是一个重大挑战。虽然强化学习是实现此类适应的一个有前景的途径,但该过程通常效率低下,阻碍了任务的快速掌握。我们提出了反思式自适应的框架,用于实现无需人工干预的快速、自主任务适应。我们的框架建立了一个自我改进的循环,其中智能体从自身经验中学习,以同时提升策略与执行能力。我们框架的核心是一个应对完整适应生命周期的双路径架构。首先,一个失败驱动的反思式强化学习路径通过利用视觉语言模型的因果推理能力,从失败分析中自动合成一个有针对性的密集奖励函数,从而实现快速学习。这提供了一个聚焦的学习信号,显著加速了策略探索。然而,优化此类代理奖励函数引入了"奖励黑客"的潜在风险,即智能体掌握了奖励函数却未能完成实际任务。为了应对这一问题,我们的第二条路径——成功驱动的质量引导监督微调——将策略锚定在整体成功上。它识别并选择性地模仿高质量的成功轨迹,确保智能体与最终任务目标保持一致。该路径通过一个条件课程机制得到加强,以辅助初始探索。我们在具有挑战性的操作任务上进行了实验。结果表明,与代表性基线方法相比,我们的框架实现了更快的收敛速度和更高的最终成功率。我们的工作为创建能够高效可靠地适应新环境的自改进智能体提供了一个稳健的解决方案。