Dual-system Vision-Language-Action (VLA) models achieve state-of-the-art robotic manipulation but are bottlenecked by the VLM backbone, which must execute at every control step while producing temporally redundant features. We propose Latent Bridge, a lightweight model that predicts VLM output deltas between timesteps, enabling the action head to operate on predicted outputs while the expensive VLM backbone is called only periodically. We instantiate Latent Bridge on two architecturally distinct VLAs: GR00T-N1.6 (feature-space bridge) and π0.5 (KV-cache bridge), demonstrating that the approach generalizes across VLA designs. Our task-agnostic DAgger training pipeline transfers across benchmarks without modification. Across four LIBERO suites, 24 RoboCasa kitchen tasks, and the ALOHA sim transfer-cube task, Latent Bridge achieves 95-100% performance retention while reducing VLM calls by 50-75%, yielding 1.65-1.73x net per-episode speedup.
翻译:双系统视觉-语言-动作(VLA)模型在机器人操作领域取得了最先进的性能,但受限于视觉语言模型(VLM)主干网络——该网络必须在每个控制步骤执行,同时产生时间冗余特征。我们提出潜在桥接方法,这是一种轻量级模型,可预测各时间步之间的VLM输出增量,使得动作头能够在预测输出上运行,而昂贵VLM主干网络仅需周期性调用。我们在两种架构不同的VLA上实例化潜在桥接方法:GR00T-N1.6(特征空间桥接)和π0.5(KV缓存桥接),证明该方法可泛化至不同VLA设计。我们提出的任务无关DAgger训练流程无需修改即可跨基准测试迁移。在四个LIBERO套件、24个RoboCasa厨房任务以及ALOHA仿真迁移立方体任务中,潜在桥接方法在将VLM调用次数减少50-75%的同时实现了95-100%的性能保持,并获得每回合1.65-1.73倍的净加速比。