Despite significant advancements in robotic manipulation, achieving consistent and stable grasping remains a fundamental challenge, often limiting the successful execution of complex tasks. Our analysis reveals that even state-of-the-art policy models frequently exhibit unstable grasping behaviors, leading to failure cases that create bottlenecks in real-world robotic applications. To address these challenges, we introduce GraspCorrect, a plug-and-play module designed to enhance grasp performance through vision-language model-guided feedback. GraspCorrect employs an iterative visual question-answering framework with two key components: grasp-guided prompting, which incorporates task-specific constraints, and object-aware sampling, which ensures the selection of physically feasible grasp candidates. By iteratively generating intermediate visual goals and translating them into joint-level actions, GraspCorrect significantly improves grasp stability and consistently enhances task success rates across existing policy models in the RLBench and CALVIN datasets.
翻译:尽管机器人操作领域取得了显著进展,但实现一致且稳定的抓取仍然是一个根本性挑战,常常限制了复杂任务的成功执行。我们的分析表明,即使是最先进的策略模型也频繁表现出不稳定的抓取行为,导致失败案例,这在现实世界的机器人应用中形成了瓶颈。为应对这些挑战,我们提出了GraspCorrect,一个即插即用模块,旨在通过视觉语言模型引导的反馈来提升抓取性能。GraspCorrect采用了一个迭代的视觉问答框架,包含两个关键组件:抓取引导提示,它整合了任务特定的约束;以及物体感知采样,它确保了物理上可行的抓取候选方案的选择。通过迭代生成中间视觉目标并将其转化为关节级动作,GraspCorrect显著提高了抓取稳定性,并在RLBench和CALVIN数据集上持续提升了现有策略模型的任务成功率。