Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

Vision-Language-Action (VLA) models have emerged as a promising approach for general-purpose robot manipulation. However, their generalization is inconsistent: while these models can perform impressively in some settings, fine-tuned variants often fail on novel objects, scenes, and instructions. We apply mechanistic interpretability techniques to better understand the inner workings of VLA models. To probe internal representations, we train Sparse Autoencoders (SAEs) on hidden layer activations of the VLA. SAEs learn a sparse dictionary whose features act as a compact, interpretable basis for the model's computation. We find that the large majority of extracted SAE features correspond to memorized sequences from specific training demonstrations. However, some features correspond to interpretable, general, and steerable motion primitives and semantic properties, offering a promising glimpse toward VLA generalizability. We propose a metric to categorize features according to whether they represent generalizable transferable primitives or episode-specific memorization. We validate these findings through steering experiments on the LIBERO benchmark. We show that individual SAE features causally influence robot behavior. Steering general features induces behaviors consistent with their semantic meaning and can be applied across tasks and scenes. This work provides the first mechanistic evidence that VLAs can learn generalizable features across tasks and scenes. We observe that supervised fine-tuning on small robotics datasets disproportionately amplifies memorization. In contrast, training on larger, more diverse datasets (e.g., DROID) or using knowledge insulation promotes more general features. We provide an open-source codebase and user-friendly interface for activation collection, SAE training, and feature steering. Our project page is located at http://drvla.github.io

翻译：视觉-语言-动作（Vision-Language-Action, VLA）模型已成为通用机器人操控的一种极具前景的方法。然而，其泛化能力并不一致：虽然这些模型在某些场景下表现出色，但微调后的变体在处理新物体、新场景和新指令时往往失败。我们应用机械可解释性技术来更深入地理解VLA模型的内部工作机制。为了探测内部表征，我们在VLA的隐藏层激活上训练了稀疏自编码器（Sparse Autoencoders, SAEs）。SAEs学习一个稀疏字典，其特征作为模型计算的紧凑、可解释基。我们发现，绝大部分提取出的SAE特征对应于具体训练演示中的记忆序列。然而，部分特征对应于可解释、通用且可操控的运动基元和语义属性，为VLA的泛化能力提供了有希望的初步证据。我们提出了一种度量标准，用于根据特征表示的是可泛化的可迁移基元还是特定情节的记忆来对特征进行分类。我们通过在LIBERO基准上的操控实验验证了这些发现。我们表明，个体SAE特征因果性地影响机器人行为。操控通用特征会引发与其语义含义一致的行为，并且可以跨任务和场景应用。此项工作首次提供了VLA能够跨任务和场景学习可泛化特征的机械性证据。我们观察到，在小规模机器人数据集上的监督微调会不成比例地放大记忆效应。相比之下，在更大、更多样化的数据集（例如DROID）上训练或采用知识隔离方法则促进了更多通用特征。我们提供开源代码库和用户友好接口，用于激活收集、SAE训练和特征操控。项目页面位于http://drvla.github.io