Vision-Language-Action (VLA) models have emerged as a promising approach for general-purpose robot manipulation. However, little research has mechanistically explored when and why they generalize across objects, scenes, and instructions. To probe internal representations, we train Sparse Autoencoders (SAEs) on the VLA's hidden-layer activations. SAEs learn sparse dictionaries over model activations, often revealing features that correspond to interpretable directions in the model's representation space. We identify SAE features corresponding to motion primitives and semantic concepts, including features that are general across episodes and causally steerable. We propose a metric to categorize features as general transferable primitives or episode-specific memorizations, offering a promising glimpse towards VLA generalization. We validate these findings through steering experiments on both the LIBERO simulation benchmark and on real-world DROID hardware. We find that amplifying general and semantic features induces behaviors consistent with their meanings, whereas ablating them destroys model performance. Furthermore, we demonstrate steering as a way to control behavior in unpromptable directions. Together, these results provide mechanistic evidence that VLAs can learn reusable internal features linking perception, language, and action across tasks and scenes. Our project page is located at https://drvla.github.io
翻译:视觉-语言-动作(VLA)模型已成为通用机器人操作领域的一种有前景的方法。然而,目前鲜有研究从机理层面探究它们何时以及为何能在物体、场景和指令之间实现泛化。为探测内部表征,我们在VLA模型的隐藏层激活上训练了稀疏自编码器(SAEs)。SAEs在模型激活上学习稀疏字典,通常能揭示对应于模型表征空间中可解释方向的特征。我们识别出对应于运动基元和语义概念的SAE特征,包括跨回合泛化且可因果操控的特征。我们提出一种度量标准,将特征分类为通用可迁移基元或特定回合记忆,为理解VLA模型泛化提供了有前景的视角。通过在LIBERO模拟基准测试和真实世界DROID硬件上的操控实验,我们验证了这些发现。我们发现,增强通用和语义特征会诱发与其含义一致的行为,而消融这些特征则会破坏模型性能。此外,我们展示了操控作为在不可提示方向上控制行为的一种方法。综合这些结果,我们提供了机理证据,表明VLA模型可以学习可复用的内部特征,从而在不同任务和场景中连接感知、语言和动作。我们的项目页面位于https://drvla.github.io。