Vision-Language-Action (VLA) models have demonstrated impressive capabilities in generalized robotic control; however, they remain notoriously brittle to linguistic perturbations. We identify a critical ``modality collapse'' phenomenon where strong visual priors overwhelm sparse linguistic signals, causing agents to overfit to specific instruction phrasings while ignoring the underlying semantic intent. To address this, we propose \textbf{Residual Semantic Steering (RSS)}, a probabilistic framework that disentangles physical affordance from semantic execution. RSS introduces two theoretical innovations: (1) \textbf{Monte Carlo Syntactic Integration}, which approximates the true semantic posterior via dense, LLM-driven distributional expansion, and (2) \textbf{Residual Affordance Steering}, a dual-stream decoding mechanism that explicitly isolates the causal influence of language by subtracting the visual affordance prior. Theoretical analysis suggests that RSS effectively maximizes the mutual information between action and intent while suppressing visual distractors. Empirical results across diverse manipulation benchmarks demonstrate that RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
翻译:视觉-语言-动作(VLA)模型在泛化机器人控制方面展现出令人印象深刻的能力;然而,它们对语言扰动仍表现出众所周知的脆弱性。我们发现了一种关键的“模态坍缩”现象:强烈的视觉先验会淹没稀疏的语言信号,导致智能体过度拟合特定的指令措辞,而忽略底层的语义意图。为解决此问题,我们提出**残差语义引导(RSS)**,一个将物理可供性与语义执行解耦的概率框架。RSS引入了两项理论创新:(1)**蒙特卡洛句法积分**,通过密集的、由大语言模型驱动的分布扩展来近似真实的语义后验;(2)**残差可供性引导**,一种双流解码机制,通过减去视觉可供性先验来显式隔离语言的因果影响。理论分析表明,RSS能有效最大化动作与意图之间的互信息,同时抑制视觉干扰物。在多样化操作基准测试中的实证结果表明,RSS实现了最先进的鲁棒性,即使在对抗性语言扰动下仍能保持性能。