Vision-Language-Action models have recently emerged as a powerful paradigm for general-purpose robot learning, enabling agents to map visual observations and natural-language instructions into executable robotic actions. Though popular, they are primarily trained via supervised fine-tuning or training-time reinforcement learning, requiring explicit fine-tuning phases, human interventions, or controlled data collection. Consequently, existing methods remain unsuitable for challenging simulated- or physical-world deployments, where robots must respond autonomously and flexibly to evolving environments. To address this limitation, we introduce a Test-Time Reinforcement Learning for VLAs (TT-VLA), a framework that enables on-the-fly policy adaptation during inference. TT-VLA formulates a dense reward mechanism that leverages step-by-step task-progress signals to refine action policies during test time while preserving the SFT/RL-trained priors, making it an effective supplement to current VLA models. Empirical results show that our approach enhances overall adaptability, stability, and task success in dynamic, previously unseen scenarios under simulated and real-world settings. We believe TT-VLA offers a principled step toward self-improving, deployment-ready VLAs.
翻译:视觉-语言-动作模型近期已成为通用机器人学习的一种强大范式,它使得智能体能够将视觉观测与自然语言指令映射为可执行的机器人动作。尽管这类模型应用广泛,但它们主要通过监督微调或训练时强化学习进行训练,需要显式的微调阶段、人工干预或受控的数据收集过程。因此,现有方法仍难以应对具有挑战性的仿真或物理世界部署场景,因为在这些场景中,机器人必须对环境变化做出自主且灵活的响应。为解决这一局限,我们提出了面向视觉-语言-动作模型的测试时强化学习框架,该框架能够在推理过程中实现策略的在线自适应。TT-VLA 设计了一种密集奖励机制,利用逐步的任务进度信号在测试时优化动作策略,同时保留通过监督微调或强化学习训练得到的先验知识,从而成为当前视觉-语言-动作模型的有效补充。实验结果表明,在仿真和真实世界的动态、未见场景中,我们的方法显著提升了模型的整体适应性、稳定性和任务成功率。我们相信 TT-VLA 为构建具备自我改进能力、可直接部署的视觉-语言-动作模型迈出了关键一步。