视觉-语言-动作模型的稳定语言引导 (Stable Language Guidance for Vision-Language-Action Models)

Vision-Language-Action (VLA) models have demonstrated impressive capabilities in generalized robotic control; however, they remain notoriously brittle to linguistic perturbations. We identify a critical ``modality collapse'' phenomenon where strong visual priors overwhelm sparse linguistic signals, causing agents to overfit to specific instruction phrasings while ignoring the underlying semantic intent. To address this, we propose \textbf{Residual Semantic Steering (RSS)}, a probabilistic framework that disentangles physical affordance from semantic execution. RSS introduces two theoretical innovations: (1) \textbf{Monte Carlo Syntactic Integration}, which approximates the true semantic posterior via dense, LLM-driven distributional expansion, and (2) \textbf{Residual Affordance Steering}, a dual-stream decoding mechanism that explicitly isolates the causal influence of language by subtracting the visual affordance prior. Theoretical analysis suggests that RSS effectively maximizes the mutual information between action and intent while suppressing visual distractors. Empirical results across diverse manipulation benchmarks demonstrate that RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.

翻译：视觉-语言-动作（VLA）模型在泛化机器人控制方面展现出令人印象深刻的能力；然而，它们对语言扰动仍表现出众所周知的脆弱性。我们发现了一种关键的“模态坍缩”现象：强烈的视觉先验会淹没稀疏的语言信号，导致智能体过度拟合特定的指令措辞，而忽略底层的语义意图。为解决此问题，我们提出**残差语义引导（RSS）**，一个将物理可供性与语义执行解耦的概率框架。RSS引入了两项理论创新：（1）**蒙特卡洛句法积分**，通过密集的、由大语言模型驱动的分布扩展来近似真实的语义后验；（2）**残差可供性引导**，一种双流解码机制，通过减去视觉可供性先验来显式隔离语言的因果影响。理论分析表明，RSS能有效最大化动作与意图之间的互信息，同时抑制视觉干扰物。在多样化操作基准测试中的实证结果表明，RSS实现了最先进的鲁棒性，即使在对抗性语言扰动下仍能保持性能。

相关内容

RSS

关注 2

RSS（简易信息聚合，也叫聚合内容）是一种描述和同步网站内容的格式。RSS可以是以下三个解释的其中一个： Really Simple Syndication；RDF (Resource Description Framework) Site Summary； Rich Site Summary。但其实这三个解释都是指同一种Syndication的技术。

视觉-语言-动作模型解析：从模块构成到里程碑与挑战

专知会员服务

17+阅读 · 2025年12月17日

面向具身操作的高效视觉–语言–动作模型：系统综述

专知会员服务

24+阅读 · 2025年10月22日

视觉-语言-动作（VLA）模型的前世今生

专知会员服务

20+阅读 · 2025年8月29日

面向具身操作的视觉-语言-动作模型综述

专知会员服务

28+阅读 · 2025年8月23日