Vision-Language-Action (VLA) models have recently demonstrated impressive capabilities across various embodied AI tasks. While deploying VLA models on real-world robots imposes strict real-time inference constraints, the inference performance landscape of VLA remains poorly understood due to the large combinatorial space of model architectures and inference systems. In this paper, we ask a fundamental research question: How should we design future VLA models and systems to support real-time inference? To address this question, we first introduce VLA-Perf, an analytical performance model that can analyze inference performance for arbitrary combinations of VLA models and inference systems. Using VLA-Perf, we conduct the first systematic study of the VLA inference performance landscape. From a model-design perspective, we examine how inference performance is affected by model scaling, model architectural choices, long-context video inputs, asynchronous inference, and dual-system model pipelines. From the deployment perspective, we analyze where VLA inference should be executed -- on-device, on edge servers, or in the cloud -- and how hardware capability and network performance jointly determine end-to-end latency. By distilling 15 key takeaways from our comprehensive evaluation, we hope this work can provide practical guidance for the design of future VLA models and inference systems.
翻译:视觉-语言-动作(VLA)模型近期在各种具身人工智能任务中展现出卓越能力。然而,在真实机器人上部署VLA模型需满足严格的实时推理约束,由于模型架构与推理系统存在巨大的组合空间,目前对VLA推理性能的认知仍十分有限。本文提出一个基础研究问题:应如何设计未来的VLA模型与系统以支持实时推理?为解决此问题,我们首先提出VLA-Perf——一个能够分析任意VLA模型与推理系统组合的解析性能模型。借助VLA-Perf,我们首次对VLA推理性能图景进行了系统性研究。从模型设计视角,我们探究了模型缩放、架构选择、长上下文视频输入、异步推理及双系统模型流水线对推理性能的影响。从部署视角,我们分析了VLA推理的执行位置(设备端、边缘服务器或云端),以及硬件能力与网络性能如何共同决定端到端延迟。通过从全面评估中提炼出15项关键结论,本研究旨在为未来VLA模型与推理系统的设计提供实践指导。