Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for generalist robotic control. Built upon vision-language model (VLM) architectures, VLAs predict actions conditioned on visual observations and language instructions, achieving strong performance and generalization across tasks. However, VLAs face two major challenges: limited long-horizon context and inefficient inference due to the quadratic attention complexity and large parameter counts. Our work is motivated by the observation that much of the visual information in a trajectory remains static across timesteps (e.g., the background). Leveraging this property, we propose SD-VLA, a framework that disentangles visual inputs into multi-level static and dynamic tokens, which enables (1) retaining a single copy of static tokens across frames to significantly reduce context length, and (2) reusing the key-value (KV) cache of static tokens through a lightweight recache gate that updates only when necessary. This design enables efficient multi-frame integration and efficient inference. In addition, we introduce a new benchmark that more effectively evaluates the long-horizon temporal dependency modeling ability of VLAs. Experimental results show that our approach outperforms baselines on this benchmark by 39.8% absolute improvement in success rate, and achieves a 3.9% gain on the SimplerEnv benchmark. Moreover, SD-VLA delivers a 2.26x inference speedup over the base VLA model on the same benchmark, enabling faster and more practical real-world deployment.

翻译：视觉-语言-动作模型近期已成为通用机器人控制领域一种颇具前景的范式。基于视觉-语言模型架构构建的VLA模型，能够根据视觉观测和语言指令预测动作，在不同任务中展现出优异的性能与泛化能力。然而，VLA模型面临两大挑战：长时程上下文信息处理能力有限，以及因注意力机制的二次方复杂度与庞大参数量导致的推理效率低下。本研究的动机源于观察到轨迹中的大部分视觉信息在时间步之间保持静态（例如背景）。利用这一特性，我们提出SD-VLA框架，该框架将视觉输入解耦为多层次的静态与动态令牌，从而实现：（1）跨帧保留静态令牌的单一副本以显著缩短上下文长度；（2）通过轻量级重缓存门仅在必要时更新静态令牌的键值缓存，实现静态令牌KV缓存的高效复用。该设计支持高效的多帧信息整合与推理加速。此外，我们引入了一个新的基准测试，以更有效地评估VLA模型对长时程时序依赖关系的建模能力。实验结果表明，我们的方法在该基准测试上的成功率绝对提升达39.8%，并在SimplerEnv基准测试上取得3.9%的性能增益。此外，在同一基准测试中，SD-VLA相比基础VLA模型实现了2.26倍的推理加速，为实际场景的快速部署提供了可能。