Vision-Language-Action (VLA) models are mainstream in embodied intelligence but face high inference costs. Edge-Cloud Collaborative (ECC) deployment offers an effective fix by easing edge-device computing pressure to meet real-time needs. However, existing ECC frameworks are suboptimal for VLA models due to two challenges: (1) Diverse model structures hinder optimal ECC segmentation point identification; (2) Even if the optimal split point is determined, changes in network bandwidth can cause performance drift. To address these issues, we propose a novel ECC deployment framework for various VLA models, termed RoboECC. Specifically, we propose a model-hardware co-aware segmentation strategy to help find the optimal segmentation point for various VLA models. Moreover, we propose a network-aware deployment adjustment approach to adapt to the network fluctuations for maintaining optimal performance. Experiments demonstrate that RoboECC achieves a speedup of up to 3.28x with only 2.55%~2.62% overhead.
翻译:视觉-语言-动作(VLA)模型是具身智能的主流方案,但面临高昂的推理成本。边云协同(ECC)部署通过缓解边缘设备计算压力以满足实时需求,为这一问题提供了有效解决方案。然而,现有ECC框架对VLA模型而言并非最优,原因在于两大挑战:(1)多样化的模型结构使得ECC最优切分点确定困难;(2)即便确定了最优切分点,网络带宽的变化也会导致性能漂移。为解决上述问题,我们提出了一种适用于各类VLA模型的新型ECC部署框架——RoboECC。具体而言,我们提出了一种模型-硬件协同感知的切分策略,有助于找到各类VLA模型的最优切分点。此外,我们提出了一种网络感知的部署调整方法,以适应网络波动并维持最优性能。实验表明,RoboECC仅以2.55%~2.62%的额外开销,即可实现最高3.28倍的加速比。