Vision-Language-Action (VLA) policies are typically shipped as Python/PyTorch stacks that assume a workstation-class GPU, a mismatch for the hardware on which robots actually run. We present vla.cpp, a portable C++ inference runtime built on llama.cpp. To our knowledge, it is the first ggml-class engine to natively serve the flow-matching and diffusion VLA inference pattern, in which a cached vision-language prefix is consumed by a cross-attending action expert integrated over several solver steps. A single runtime serves seven architectures spanning five backbone and four action-head families behind one request/response protocol, with each model packaged as a self-contained bundle. On LIBERO-Object, the engine matches a state-of-the-art checkpoint to within one episode out of 200, and runs BitVLA at 100% success in 1.3 GiB of memory. The same bundle runs unchanged across three hardware tiers, from a consumer GPU down to an 8 GB embedded module. A cross-hardware roofline analysis shows that batch-1 VLA inference is compute-bound, so utilization rather than bandwidth is the deployment lever; an IMMA ladder GEMM derived from this analysis cuts BitVLA per-step latency by 4.5x. We then frame an on-robot stress test on an ALOHA arm that isolates the latency constraint under which a learned VLA must replan against a moving target on the hardware it was trained for. Code, demo videos, and the reproducible benchmark scaffold are available at https://fai-modelopt-tech.github.io/vla-cpp.github.io/.
翻译:视觉-语言-动作策略通常以Python/PyTorch堆栈形式部署,需要工作站级GPU,这与机器人实际运行的硬件不匹配。我们提出vla.cpp,这是一个基于llama.cpp构建的可移植C++推理运行时。据我们所知,这是首个原生支持流匹配与扩散VLA推理模式的ggml类引擎:在该模式中,缓存的视觉-语言前缀由交叉注意力动作专家模块在多个求解器步骤中整合消耗。单一运行时通过统一请求/响应协议支持涵盖五个主干家族和四个动作头家族的七种架构,每个模型打包为自包含组件。在LIBERO-Object上,该引擎在200个测试片段内与最新检查点仅差一个片段;运行BitVLA时,以1.3 GiB内存实现100%成功。同一组件可在三档硬件上无修改运行——从消费级GPU到8 GB嵌入式模块。跨硬件天花线分析表明,批量为1的VLA推理受限于计算能力,因此利用率而非带宽是部署的关键杠杆;由此分析导出的IMMA阶梯式通用矩阵乘法将BitVLA每步延迟降低4.5倍。随后,我们在ALOHA机械臂上设计机载压力测试,隔离受训硬件上学习型VLA必须针对移动目标重新规划时的延迟约束。代码、演示视频及可复现基准框架可在https://fai-modelopt-tech.github.io/vla-cpp.github.io/获取。