Vision-Language-Action (VLA) models are promising for generalist robot control, but on-robot deployment is bottlenecked by real-time inference under tight cost and energy budgets. Most prior evaluations rely on desktop-grade GPUs, obscuring the trade-offs and opportunities offered by heterogeneous edge accelerators (GPUs/XPUs/NPUs). We present a systematic analysis for low-cost VLA deployment via model-hardware co-characterization. First, we build a cross-accelerator leaderboard and evaluate model-hardware pairs under CET (Cost, Energy, Time), showing that right-sized edge devices can be more cost-/energy-efficient than flagship GPUs while meeting control-rate constraints. Second, using in-depth profiling, we uncover a consistent two-phase inference pattern: a compute-bound VLM backbone followed by a memory-bound Action Expert, which induces phase-dependent underutilization and hardware inefficiency. Finally, guided by these insights, we propose DP-Cache and V-AEFusion to reduce diffusion redundancy and enable asynchronous pipeline parallelism, achieving up to 2.9x speedup on GPUs and 6x on edge NPUs with only marginal success degradation. The example leaderboard website is available at: https://vla-leaderboard-01.vercel.app/.
翻译:视觉-语言-动作(VLA)模型在通用机器人控制领域展现出巨大潜力,但其在机器人本体上的部署受限于严格成本与能耗预算下的实时推理瓶颈。现有评估多依赖桌面级GPU,难以揭示异构边缘加速器(GPU/XPU/NPU)带来的性能权衡与优化机遇。本文通过模型-硬件协同表征方法,系统分析低成本VLA部署方案。首先,我们构建跨加速器排行榜,在CET(成本、能耗、时间)指标下评估模型-硬件组合,证明尺寸适配的边缘设备可在满足控制速率约束的同时,比旗舰级GPU更具成本与能效优势。其次,通过深度性能剖析,我们揭示出两阶段推理模式的一致性特征:计算密集型的视觉语言模型(VLM)主干网络后接内存密集型的动作专家模块,这种阶段特性导致硬件利用率降低与能效损失。基于上述发现,我们提出DP-Cache与V-AEFusion两项优化技术,分别通过减少扩散冗余与实现异步流水线并行,使GPU加速比达2.9倍,边缘NPU加速比达6倍,同时仅带来微乎其微的成功率下降。示例排行榜网站访问地址:https://vla-leaderboard-01.vercel.app/。