The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities. Code is publicly available: https://blind-vla-paper.github.io
翻译:视觉-语言-动作(VLA)模型日益增长的成功源于预训练视觉-语言模型(VLM)能够赋予智能体可迁移的世界知识与视觉-语言(VL)基础,为具有更广泛泛化能力的动作模型奠定基础。然而,当这些VLM被适配至动作模态时,其原始的VL表征与知识在多大程度上得以保留仍不明确。本研究对VLA微调过程中的表征保持性进行了系统性探究,表明简单的动作微调会导致视觉表征退化。为刻画并量化这些效应,我们探测了VLA的隐藏表征并分析了注意力图;进一步设计了一套针对性任务与方法,通过对比VLA模型与其对应的VLM,分离出动作微调引发的VL能力变化。我们还评估了一系列视觉表征对齐策略,并提出一种简单而有效的方法,该方法能缓解表征退化并提升对分布外(OOD)场景的泛化性能。综合而言,我们的分析阐明了动作微调与VL表征退化之间的权衡关系,并强调了恢复继承性VL能力的实用途径。代码已公开:https://blind-vla-paper.github.io