Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for generalist robotic control. Built upon vision-language model (VLM) architectures, VLAs predict actions conditioned on visual observations and language instructions, achieving strong performance and generalization across tasks. However, VLAs face two major challenges: limited long-horizon context and inefficient inference due to the quadratic attention complexity and large parameter counts. Our work is motivated by the observation that much of the visual information in a trajectory remains static across timesteps (e.g., the background). Leveraging this property, we propose SD-VLA, a framework that disentangles visual inputs into multi-level static and dynamic tokens, which enables (1) retaining a single copy of static tokens across frames to significantly reduce context length, and (2) reusing the key-value (KV) cache of static tokens through a lightweight recache gate that updates only when necessary. This design enables efficient multi-frame integration and efficient inference. In addition, we introduce a new benchmark that more effectively evaluates the long-horizon temporal dependency modeling ability of VLAs. Experimental results show that our approach outperforms baselines on this benchmark by 39.8% absolute improvement in success rate, and achieves a 3.9% gain on the SimplerEnv benchmark. Moreover, SD-VLA delivers a 2.26x inference speedup over the base VLA model on the same benchmark, enabling faster and more practical real-world deployment.

翻译：视觉-语言-动作（VLA）模型近期已成为通用机器人控制领域一种前景广阔的范式。基于视觉-语言模型（VLM）架构构建的VLA模型，能够根据视觉观测和语言指令预测动作，在跨任务场景中展现出优异的性能与泛化能力。然而，VLA模型面临两大挑战：长时程上下文建模能力有限，以及因注意力机制的二次方复杂度与庞大参数量导致的推理效率低下。本研究的动机源于观察到轨迹中的大量视觉信息在时间步间保持静态（例如背景）。基于此特性，我们提出SD-VLA框架，将视觉输入解耦为多层次的静态与动态令牌。该设计能够：（1）跨帧保留静态令牌的单一副本，从而显著缩短上下文长度；（2）通过轻量级重缓存门仅在必要时更新静态令牌的键值（KV）缓存，实现缓存复用。这一架构支持高效的多帧信息整合与推理加速。此外，我们引入了一个新的基准测试，以更有效地评估VLA模型的长时程时序依赖建模能力。实验结果表明，我们的方法在该基准测试上的成功率绝对提升达39.8%，并在SimplerEnv基准测试上取得3.9%的性能增益。同时，SD-VLA在同一基准测试上相比基础VLA模型实现了2.26倍的推理加速，为实际场景的快速部署提供了可行路径。