World Action Models (WAMs) generalize better than standard Vision-Language-Action (VLA) policies to novel motions and environments, because a video-modeling objective lets them learn from abundant unlabeled video rather than scarce labeled robot demonstrations. This generalization is computationally expensive. To complete a task, a WAM runs over multiple inference chunks, and each chunk requires a costly denoising process. Existing acceleration methods reduce this cost by caching and reusing computation within a single chunk's denoising trajectory. Our empirical analysis reveals a substantial source of redundancy they overlook: redundancy across chunks. When a robot executes a smooth behavior, the residuals computed at a given denoising step are strongly correlated from one chunk to the next. We introduce C$^3$ache, a training-free method that caches and reuses these residuals across inference chunks at the same denoising step. Experiments on benchmarks with a Fast-WAM backbone show that C$^3$ache achieves up to a $2.5\times$ speedup in total wall-clock inference time, with negligible degradation in task success rate.
翻译:世界动作模型(WAMs)相比标准视觉-语言-动作(VLA)策略,能更好地泛化至新型运动与场景,原因在于其视频建模目标使其能够从大量无标注视频而非稀缺的标注机器人演示中学习。然而,这种泛化能力计算开销极大。为完成任务,WAM需运行多个推理块,每个块均需经历高成本的去噪过程。现有加速方法通过缓存并复用单个块去噪轨迹内的计算结果来降低该成本。我们的实证分析揭示了它们忽略的重要冗余来源:块间冗余。当机器人执行平滑行为时,同一去噪步长下相邻块计算出的残差高度相关。我们提出C$^3$ache——一种免训练方法,通过跨推理块缓存并复用相同去噪步长下的残差。基于Fast-WAM骨干的基准实验表明,C$^3$ache在总壁钟推理时间上可实现高达2.5倍加速,且任务成功率几乎无下降。