Vision-Language-Action (VLA) models have emerged as powerful generalist policies for robotic control, yet their performance scaling across model architectures and hardware platforms, as well as their associated power budgets, remain poorly understood. This work presents an evaluation of five representative VLA models -- spanning state-of-the-art baselines and two newly proposed architectures -- targeting edge and datacenter GPU platforms. Using the LIBERO benchmark, we measure accuracy alongside system-level metrics, including latency, throughput, and peak memory usage, under varying edge power constraints and high-performance datacenter GPU configurations. Our results identify distinct scaling trends: (1) architectural choices, such as action tokenization and model backbone size, strongly influence throughput and memory footprint; (2) power-constrained edge devices exhibit non-linear performance degradation, with some configurations matching or exceeding older datacenter GPUs; and (3) high-throughput variants can be achieved without significant accuracy loss. These findings provide actionable insights when selecting and optimizing VLAs across a range of deployment constraints. Our work challenges current assumptions about the superiority of datacenter hardware for robotic inference.
翻译:视觉-语言-动作模型已成为机器人控制领域强大的通用策略,但其在不同模型架构与硬件平台间的性能扩展特性及相关功耗预算仍缺乏深入理解。本研究评估了五类代表性VLA模型——涵盖前沿基线模型与两种新提出的架构——针对边缘与数据中心GPU平台。通过LIBERO基准测试,我们在不同边缘功耗约束与高性能数据中心GPU配置下,测量了准确率及系统级指标(包括延迟、吞吐量与峰值内存使用量)。研究结果揭示了以下扩展趋势:(1)动作标记化与模型骨干网络规模等架构选择对吞吐量与内存占用具有显著影响;(2)功耗受限的边缘设备呈现非线性性能衰减,部分配置可匹配或超越早期数据中心GPU;(3)可在保持精度无明显损失的前提下实现高吞吐量变体。这些发现为在不同部署约束条件下选择与优化VLA模型提供了可操作的见解。本研究对当前关于数据中心硬件在机器人推理任务中优越性的假设提出了挑战。