Vision-Language-Action (VLA) models have emerged as a dominant paradigm for generalist robotic manipulation, unifying perception and control within a single end-to-end architecture. However, despite their success in controlled environments, reliable real-world deployment is severely hindered by their fragility to visual disturbances. While existing literature extensively addresses physical occlusions caused by scene geometry, a critical mode remains largely unexplored: image corruptions. These sensor-level artifacts, ranging from electronic noise and dead pixels to lens contaminants, directly compromise the integrity of the visual signal prior to interpretation. In this work, we quantify this vulnerability, demonstrating that state-of-the-art VLAs such as $π_{0.5}$ and SmolVLA, suffer catastrophic performance degradation, dropping from 90\% success rates to as low as 2\%, under common signal artifacts. To mitigate this, we introduce the Corruption Restoration Transformer (CRT), a plug-and-play and model-agnostic vision transformer designed to immunize VLA models against sensor disturbances. Leveraging an adversarial training objective, CRT restores clean observations from corrupted inputs without requiring computationally expensive fine-tuning of the underlying model. Extensive experiments across the LIBERO and Meta-World benchmarks demonstrate that CRT effectively recovers lost performance, enabling VLAs to maintain near-baseline success rates, even under severe visual corruption.
翻译:视觉-语言-动作模型已成为通用机器人操作的主导范式,将感知与控制统一于单一的端到端架构中。然而,尽管其在受控环境中取得了成功,但其对视觉干扰的脆弱性严重阻碍了其在现实世界中的可靠部署。现有研究广泛关注由场景几何结构引起的物理遮挡,但一个关键模式在很大程度上尚未被探索:图像损坏。这些传感器层面的伪影——从电子噪声和坏点到镜头污染物——在视觉信号被解读之前就直接损害了其完整性。在本研究中,我们量化了这种脆弱性,证明最先进的VLA模型(如$π_{0.5}$和SmolVLA)在常见的信号伪影下,性能会遭受灾难性下降,成功率从90%骤降至低至2%。为缓解此问题,我们引入了损坏恢复Transformer,这是一种即插即用且与模型无关的视觉Transformer,旨在使VLA模型免受传感器干扰。通过利用对抗性训练目标,CRT能够从受损输入中恢复干净的观测结果,而无需对底层模型进行计算成本高昂的微调。在LIBERO和Meta-World基准测试上进行的大量实验表明,CRT能有效恢复损失的性能,使VLA模型即使在严重的视觉损坏下也能维持接近基线的成功率。