Vision-Language-Action Models (VLAs) have shown remarkable progress towards embodied intelligence. While their architecture partially resembles that of Large Language Models (LLMs), VLAs exhibit higher complexity due to their multi-modal inputs/outputs and often hybrid nature of transformer and diffusion heads. This is part of the reason why insights from mechanistic interpretability in LLMs, which explain how the internal model representations relate to their output behavior, do not trivially transfer to VLA counterparts. In this work, we propose to close this gap by introducing and analyzing two main concepts: feature-observability and feature-controllability. In particular, we first study features that are linearly encoded in representation space, and show how they can be observed by means of a linear classifier. Then, we use a minimal linear intervention grounded in optimal control to accurately place internal representations and steer the VLA's output towards a desired region. Our results show that targeted, lightweight interventions can reliably steer a robot's behavior while preserving closed-loop capabilities. We demonstrate on different VLA architectures ($π_{0.5}$ and OpenVLA) through simulation experiments that VLAs possess interpretable internal structure amenable to online adaptation without fine-tuning, enabling real-time alignment with user preferences and task requirements.
翻译:视觉-语言-动作模型(VLAs)在具身智能领域展现出显著进展。尽管其架构部分类似于大语言模型(LLMs),但由于其多模态输入/输出以及常结合Transformer与扩散头的混合特性,VLAs表现出更高的复杂性。这也是为何从LLMs机制可解释性研究中获得的洞见——即解释内部模型表征如何关联于输出行为——不能直接迁移至VLA模型的部分原因。本研究通过引入并分析两个核心概念:特征可观测性与特征可控性,致力于填补这一空白。具体而言,我们首先研究在表征空间中线性编码的特征,并展示如何通过线性分类器对其进行观测。随后,我们基于最优控制理论设计一种最小线性干预方法,以精确调整内部表征并引导VLA的输出朝向目标区域。实验结果表明,定向的轻量级干预能够可靠地引导机器人行为,同时保持其闭环控制能力。我们通过仿真实验在不同VLA架构($π_{0.5}$与OpenVLA)上验证了VLAs具有可解释的内部结构,且无需微调即可实现在线适应,从而能够实时对齐用户偏好与任务需求。