Pre-trained Vision-Language-Action (VLA) models represent a major leap towards general-purpose robots, yet efficiently adapting them to novel, specific tasks in-situ remains a significant hurdle. While reinforcement learning (RL) is a promising avenue for such adaptation, the process often suffers from low efficiency, hindering rapid task mastery. We introduce Reflective Self-Adaptation, a framework for rapid, autonomous task adaptation without human intervention. Our framework establishes a self-improving loop where the agent learns from its own experience to enhance both strategy and execution. The core of our framework is a dual-pathway architecture that addresses the full adaptation lifecycle. First, a Failure-Driven Reflective RL pathway enables rapid learning by using the VLM's causal reasoning to automatically synthesize a targeted, dense reward function from failure analysis. This provides a focused learning signal that significantly accelerates policy exploration. However, optimizing such proxy rewards introduces a potential risk of "reward hacking," where the agent masters the reward function but fails the actual task. To counteract this, our second pathway, Success-Driven Quality-Guided SFT, grounds the policy in holistic success. It identifies and selectively imitates high-quality successful trajectories, ensuring the agent remains aligned with the ultimate task goal. This pathway is strengthened by a conditional curriculum mechanism to aid initial exploration. We conduct experiments in challenging manipulation tasks. The results demonstrate that our framework achieves faster convergence and higher final success rates compared to representative baselines. Our work presents a robust solution for creating self-improving agents that can efficiently and reliably adapt to new environments.
翻译:预训练的视觉-语言-动作(Vision-Language-Action, VLA)模型代表了向通用机器人迈出的重大飞跃,但如何高效地将其原位适应于新颖、具体的任务仍是一个重大挑战。尽管强化学习(RL)是实现这种适应性的有前景途径,但其过程常常效率低下,阻碍了任务的快速掌握。我们提出了反思式自适应(Reflective Self-Adaptation)框架,这是一种无需人工干预即可实现快速、自主任务适应的方案。该框架建立了一个自我改进的循环,使得智能体能够从自身经验中学习,以同时优化策略与执行。框架的核心是一种双路径架构,旨在应对完整的适应生命周期。首先,故障驱动反思式RL路径通过利用VLM的因果推理能力,从失败分析中自动合成有针对性的稠密奖励函数,从而实现快速学习。这提供了集中的学习信号,显著加速了策略探索。然而,优化这种代理奖励存在潜在的“奖励破解”风险,即智能体可能掌握奖励函数却未能完成实际任务。为应对此问题,我们的第二条路径——成功驱动质量引导式SFT,将策略锚定于整体成功。它识别并选择性模仿高质量的成功轨迹,确保智能体始终与最终任务目标保持一致。该路径通过条件式课程机制得到强化,以辅助初始探索。我们在具有挑战性的操作任务上进行了实验。结果表明,与代表性基线相比,我们的框架实现了更快的收敛速度和更高的最终成功率。本工作为构建能够高效、可靠适应新环境的自我改进智能体提供了稳健的解决方案。