While vision-language-action (VLA) models for embodied agents integrate perception, reasoning, and control, they remain constrained by two critical weaknesses: first, during grasping tasks, the action tokens generated by the language model often exhibit subtle spatial deviations from the target object, resulting in grasp failures; second, they lack the ability to reliably recognize task completion, which leads to redundant actions and frequent timeout errors. To address these challenges and enhance robustness, we propose a lightweight, training-free framework, VLA-SCT. This framework operates as a self-correcting control loop, combining data-driven action refinement with conditional logic for termination. Consequently, compared to baseline approaches, our method achieves consistent improvements across all datasets in the LIBERO benchmark, significantly increasing the success rate of fine manipulation tasks and ensuring accurate task completion, thereby promoting the deployment of more reliable VLA agents in complex, unstructured environments.
翻译:尽管用于具身智能体的视觉-语言-动作(VLA)模型集成了感知、推理与控制能力,但仍受限于两个关键缺陷:首先,在抓取任务中,语言模型生成的动作标记常与目标物体存在细微的空间偏差,导致抓取失败;其次,模型缺乏可靠识别任务完成状态的能力,进而引发冗余动作与频繁的超时错误。为应对这些挑战并增强鲁棒性,我们提出一种轻量级、无需训练的框架VLA-SCT。该框架作为自校正控制循环运行,结合了数据驱动的动作优化与基于条件的终止逻辑。实验表明,相较于基线方法,我们的方案在LIBERO基准测试的所有数据集中均取得稳定提升,显著提高了精细操作任务的成功率,并确保任务准确完成,从而推动更可靠的VLA智能体在复杂非结构化环境中的部署。