运行时观测干预增强视觉-语言-动作模型的视觉鲁棒性 (Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust)

Vision-language-action (VLA) models trained on large-scale internet data and robot demonstrations have the potential to serve as generalist robot policies. However, despite their large-scale training, VLAs are often brittle to task-irrelevant visual details such as distractor objects or background colors. We introduce Bring Your Own VLA (BYOVLA): a run-time intervention scheme that (1) dynamically identifies regions of the input image that the model is sensitive to, and (2) minimally alters task-irrelevant regions to reduce the model's sensitivity using automated image editing tools. Our approach is compatible with any off the shelf VLA without model fine-tuning or access to the model's weights. Hardware experiments on language-instructed manipulation tasks demonstrate that BYOVLA enables state-of-the-art VLA models to nearly retain their nominal performance in the presence of distractor objects and backgrounds, which otherwise degrade task success rates by up to 40%. Website with additional information, videos, and code: https://aasherh.github.io/byovla/ .

翻译：在大规模互联网数据与机器人演示数据上训练的视觉-语言-动作模型具备成为通用机器人策略的潜力。然而，尽管经过大规模训练，此类模型仍常因任务无关的视觉细节（如干扰物体或背景颜色）而表现脆弱。本文提出"自带视觉-语言-动作模型"方案：一种运行时干预机制，通过（1）动态识别输入图像中模型敏感的区域，（2）利用自动化图像编辑工具对任务无关区域进行最小化修改以降低模型敏感性。该方法兼容任何现成的视觉-语言-动作模型，无需微调或获取模型权重。在语言指令操控任务的硬件实验中证明，BYOVLA能使前沿视觉-语言-动作模型在存在干扰物体和背景的情况下保持接近标称性能，而原本这些干扰会导致任务成功率下降高达40%。更多信息、视频及代码详见项目网站：https://aasherh.github.io/byovla/。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/