Vision-language-action models (VLAs) combine vision-language backbones with expressive generative action heads trained via flow matching on large-scale robotic datasets. Despite their strong empirical performance in robotic manipulation, VLAs lack mechanisms to quantify confidence in their predictions and to detect when their actions may be unreliable. This presents a critical limitation for real-world deployment in non-stationary environments, where models inevitably encounter scenarios outside their pretraining distribution and may fail without warning. To address this, we derive an efficient method for quantifying epistemic uncertainty in flow-matching models by leveraging velocity-field disagreement (VFD) across a small ensemble. We successfully use this uncertainty estimate for failure detection during deployment and active fine-tuning of flow-based VLAs. To this end, we propose SAVE, a framework for uncertainty-guided active multitask fine-tuning that reduces the number of costly expert demonstrations required to adapt VLAs to new tasks. Through extensive experiments on the LIBERO benchmark, we demonstrate that VFD yields better-calibrated uncertainty estimates predictive of downstream performance, that VFD achieves strong performance in detecting failures, and that uncertainty-guided data acquisition with SAVE requires at least 22% fewer samples than baselines. In summary, our work shows that quantifying epistemic uncertainty in flow-based VLAs improves both failure awareness and adaptation. Project website: tum-lsy.github.io/uq_vla/.
翻译:视觉-语言-动作模型(VLAs)结合了视觉-语言骨干网络与通过流匹配在大规模机器人数据集上训练的生成式动作头部。尽管VLA在机器人操作任务中展现出强大的实证性能,但其缺乏量化预测置信度与检测动作不可靠性的机制。这一缺陷对非平稳环境下的实际部署构成关键限制——模型不可避免地会遇到预训练分布之外的场景,并可能在无预警的情况下失效。为解决此问题,我们提出一种高效方法,通过小规模集成的速度场差异(VFD)量化流匹配模型中的认知不确定性。我们成功将该不确定性估计用于部署期间的故障检测及基于流的VLA主动微调。为此,我们提出SAVE框架,一种面向不确定性引导的主动多任务微调方法,可将VLA适应新任务所需的高成本专家示范数量减少22%以上。通过LIBERO基准上的大量实验,我们证明:VFD能生成校准良好且可预测下游性能的不确定性估计,在故障检测中表现优异,且基于不确定性引导的SAVE数据采集所需样本量比基线方法至少少22%。总之,本研究证明量化基于流的VLA认知不确定性可同时提升故障感知与适应能力。项目网站:tum-lsy.github.io/uq_vla/。