Vision-Language-Action (VLA) models are mainstream in embodied intelligence but face high inference costs. Edge-Cloud Collaborative (ECC) deployment offers an effective fix by easing edge-device computing pressure to meet real-time needs. However, existing ECC frameworks are suboptimal for VLA models due to two challenges: (1) Diverse model structures hinder optimal ECC segmentation point identification; (2) Even if the optimal split point is determined, changes in network bandwidth can cause performance drift. To address these issues, we propose a novel ECC deployment framework for various VLA models, termed RoboECC. Specifically, we propose a model-hardware co-aware segmentation strategy to help find the optimal segmentation point for various VLA models. Moreover, we propose a network-aware deployment adjustment approach to adapt to the network fluctuations for maintaining optimal performance. Experiments demonstrate that RoboECC achieves a speedup of up to 3.28x with only 2.55x~2.62x overhead.
翻译:视觉-语言-动作(VLA)模型是具身智能的主流方法,但面临高昂的推理成本。边缘-云端协同(ECC)部署通过减轻边缘设备计算压力以满足实时需求,提供了有效解决途径。然而,现有ECC框架因两大挑战无法适配VLA模型:(1)多样化模型结构阻碍ECC最优切分点的识别;(2)即使确定最优切分点,网络带宽变化仍会导致性能漂移。为解决这些问题,我们提出了面向多种VLA模型的新型ECC部署框架RoboECC。具体而言,我们提出模型-硬件联合感知切分策略,帮助为各类VLA模型定位最优切分点。此外,我们提出网络感知部署调整方法,以适应网络波动并维持最优性能。实验表明,RoboECC在仅产生2.55x~2.62x开销的同时,实现了最高3.28x的加速比。