Automating Robot Failure Recovery Using Vision-Language Models With Optimized Prompts

Current robot autonomy struggles to operate beyond the assumed Operational Design Domain (ODD), the specific set of conditions and environments in which the system is designed to function, while the real-world is rife with uncertainties that may lead to failures. Automating recovery remains a significant challenge. Traditional methods often rely on human intervention to manually address failures or require exhaustive enumeration of failure cases and the design of specific recovery policies for each scenario, both of which are labor-intensive. Foundational Vision-Language Models (VLMs), which demonstrate remarkable common-sense generalization and reasoning capabilities, have broader, potentially unbounded ODDs. However, limitations in spatial reasoning continue to be a common challenge for many VLMs when applied to robot control and motion-level error recovery. In this paper, we investigate how optimizing visual and text prompts can enhance the spatial reasoning of VLMs, enabling them to function effectively as black-box controllers for both motion-level position correction and task-level recovery from unknown failures. Specifically, the optimizations include identifying key visual elements in visual prompts, highlighting these elements in text prompts for querying, and decomposing the reasoning process for failure detection and control generation. In experiments, prompt optimizations significantly outperform pre-trained Vision-Language-Action Models in correcting motion-level position errors and improve accuracy by 65.78% compared to VLMs with unoptimized prompts. Additionally, for task-level failures, optimized prompts enhanced the success rate by 5.8%, 5.8%, and 7.5% in VLMs' abilities to detect failures, analyze issues, and generate recovery plans, respectively, across a wide range of unknown errors in Lego assembly.

翻译：当前机器人自主系统难以在预设的操作设计域（ODD）之外运行，而现实世界充满不确定性，常导致系统故障。实现自动化恢复仍面临重大挑战。传统方法通常依赖人工干预处理故障，或需穷举故障案例并为每种场景设计特定恢复策略，这两种方式均需大量人力。基础视觉语言模型（VLMs）展现出卓越的常识泛化与推理能力，具有更广泛甚至可能无界的操作设计域。然而，在应用于机器人控制与运动级误差恢复时，空间推理能力的局限仍是多数视觉语言模型的普遍挑战。本文研究如何通过优化视觉与文本提示来增强视觉语言模型的空间推理能力，使其能作为黑盒控制器有效执行运动级位置校正与任务级未知故障恢复。具体优化方法包括：识别视觉提示中的关键要素，在文本提示中突出这些要素以进行查询，以及对故障检测与控制生成的推理过程进行分解。实验表明，经过提示优化的模型在运动级位置误差校正方面显著优于预训练的视觉-语言-动作模型，与未优化提示的视觉语言模型相比准确率提升65.78%。此外，针对任务级故障，在乐高积木组装任务中面对各类未知错误时，优化提示使视觉语言模型的故障检测、问题分析与恢复计划生成能力分别提升5.8%、5.8%和7.5%。