Vision Language Action (VLA) models are mainstream in embodied intelligence but face high inference costs. Edge-Cloud Collaborative (ECC) inference offers an effective fix by easing edge-device computing pressure to meet real-time needs. However, existing ECC frameworks are suboptimal for VLA models due to two challenges: (1) Mainstream environment-oriented edge-cloud partitioning methods are susceptible to interference from visual noise; (2) Existing edge-cloud partitioning methods overlook the step-wise redundancy unique to embodied tasks, thereby disrupting the physical continuity of motion. To address these issues, we propose a novel ECC inference framework, termed RAPID. Specifically, we developed an implementation tailored to the proposed framework. Experiments demonstrate this achieves a speedup of up to 1.73x with only 5%~7% overhead.
翻译:视觉语言动作(VLA)模型是具身智能领域的主流方法,但其推理成本高昂。边云协同(ECC)推理通过减轻边缘设备的计算压力以满足实时性需求,提供了一种有效的解决方案。然而,现有ECC框架对VLA模型而言并非最优,主要面临两大挑战:(1) 主流面向环境的边云划分方法易受视觉噪声干扰;(2) 现有边云划分方法忽视了具身任务特有的阶段性冗余,从而破坏了动作的物理连续性。为解决这些问题,我们提出了一种新颖的ECC推理框架,命名为RAPID。具体而言,我们为该框架开发了定制化的实现方案。实验表明,该框架在仅引入5%~7%开销的情况下,最高可实现1.73倍的加速比。