Vision-Language-Action (VLA) models have demonstrated strong multi-modal reasoning capabilities, enabling direct action generation from visual perception and language instructions in an end-to-end manner. However, their substantial computational cost poses a challenge for real-time robotic control, where rapid decision-making is essential. This paper introduces VLA-Cache, a training-free inference acceleration method that reduces computational overhead by adaptively caching and reusing static visual tokens across frames. Exploiting the temporal continuity in robotic manipulation, VLA-Cache identifies minimally changed tokens between adjacent frames and reuses their cached key-value representations, thereby circumventing redundant computations. Additionally, to maintain action precision, VLA-Cache selectively re-computes task-relevant tokens that are environmentally sensitive, ensuring the fidelity of critical visual information. To further optimize efficiency, we introduce a layer adaptive token reusing strategy that dynamically adjusts the reuse ratio based on attention concentration across decoder layers, prioritizing critical tokens for recomputation. Extensive experiments on two simulation platforms (LIBERO and SIMPLER) and a real-world robotic system demonstrate that VLA-Cache achieves up to 1.7x speedup in CUDA latency and a 15% increase in control frequency, with negligible loss on task success rate. The code and videos can be found at our project page: https://vla-cache.github.io.
翻译:视觉-语言-动作(VLA)模型已展现出强大的多模态推理能力,能够以端到端方式直接从视觉感知和语言指令生成动作。然而,其巨大的计算成本对需要快速决策的实时机器人控制构成了挑战。本文提出VLA-Cache,一种免训练的推理加速方法,通过自适应缓存和跨帧复用静态视觉令牌来降低计算开销。该方法利用机器人操作中的时间连续性,识别相邻帧间变化最小的令牌并复用其缓存的键值表示,从而规避冗余计算。此外,为保持动作精度,VLA-Cache选择性地重新计算对环境敏感的任务相关令牌,确保关键视觉信息的保真度。为进一步优化效率,我们提出层自适应令牌复用策略,根据解码器层间的注意力集中度动态调整复用比例,优先对关键令牌进行重新计算。在两个仿真平台(LIBERO与SIMPLER)及真实机器人系统上的大量实验表明,VLA-Cache在CUDA延迟上最高可实现1.7倍加速,控制频率提升15%,且任务成功率损失可忽略不计。代码与演示视频详见项目页面:https://vla-cache.github.io。