Vision-language-action (VLA) models trained on large-scale internet data and robot demonstrations have the potential to serve as generalist robot policies. However, despite their large-scale training, VLAs are often brittle to task-irrelevant visual details such as distractor objects or background colors. We introduce Bring Your Own VLA (BYOVLA): a run-time intervention scheme that (1) dynamically identifies regions of the input image that the model is sensitive to, and (2) minimally alters task-irrelevant regions to reduce the model's sensitivity using automated image editing tools. Our approach is compatible with any off the shelf VLA without model fine-tuning or access to the model's weights. Hardware experiments on language-instructed manipulation tasks demonstrate that BYOVLA enables state-of-the-art VLA models to nearly retain their nominal performance in the presence of distractor objects and backgrounds, which otherwise degrade task success rates by up to 40%. Website with additional information, videos, and code: https://aasherh.github.io/byovla/ .
翻译:在大规模互联网数据与机器人演示数据上训练的视觉-语言-动作模型具备成为通用机器人策略的潜力。然而,尽管经过大规模训练,此类模型仍常因任务无关的视觉细节(如干扰物体或背景颜色)而表现脆弱。本文提出"自带视觉-语言-动作模型"方案:一种运行时干预机制,通过(1)动态识别输入图像中模型敏感的区域,(2)利用自动化图像编辑工具对任务无关区域进行最小化修改以降低模型敏感性。该方法兼容任何现成的视觉-语言-动作模型,无需微调或获取模型权重。在语言指令操控任务的硬件实验中证明,BYOVLA能使前沿视觉-语言-动作模型在存在干扰物体和背景的情况下保持接近标称性能,而原本这些干扰会导致任务成功率下降高达40%。更多信息、视频及代码详见项目网站:https://aasherh.github.io/byovla/。