Vision-language-action (VLA) policies provide strong priors for language-conditioned manipulation, but remain brittle in off-nominal states requiring targeted recovery. We propose ReCoVLA -- a failure-conditioned residual recovery framework that keeps a pretrained VLA policy frozen, uses an external vision-language model (VLM) to infer the failure mode and recovery stage, and compiles a structured reward from task-relevant components. Rather than using the VLM to generate actions or rewards directly, ReCoVLA uses it as a semantic reward selector: it predicts a recovery descriptor and reward mask for in-simulation residual-policy training, followed by zero-shot sim-to-real deployment of the trained recovery policies. This decouples high-level failure understanding from low-level corrective control to support different VLAs. Experiments across short-horizon, long-horizon, and contact-rich manipulation tasks show that ReCoVLA outperforms the tested baselines on average. In simulation, our reward compiler improves average success from 36.7% for the fine-tuned $π_{0.5}$ baseline to 66.7%. In physical zero-shot sim-to-real experiments, ReCoVLA achieves the best average performance, with 61.7% success.
翻译:视觉-语言-动作(VLA)策略为语言条件操控提供了强先验知识,但在需要针对性恢复的非标称状态下仍显脆弱。我们提出ReCoVLA——一种故障条件残差恢复框架,该框架保持预训练的VLA策略冻结不变,借助外部视觉语言模型(VLM)推断故障模式与恢复阶段,并从任务相关组件中编译结构化奖励。ReCoVLA并非直接使用VLM生成动作或奖励,而是将其作为语义奖励选择器:它预测恢复描述符与奖励掩码,用于仿真内残差策略训练,随后将训练好的恢复策略零样本从仿真迁移至现实。这种设计将高层故障理解与低层纠正控制解耦,以支持不同的VLA策略。在短周期、长周期及接触密集操控任务的实验中,ReCoVLA平均性能优于所有测试基线。在仿真中,我们的奖励编译器将微调后的$π_{0.5}$基线的平均成功率从36.7%提升至66.7%。在物理零样本仿真到现实迁移实验中,ReCoVLA以61.7%的成功率实现了最佳平均性能。