Vision-Language-Action (VLA) models are an emerging class of workloads critical for robotics and embodied AI at the edge. As these models scale, they demonstrate significant capability gains, yet they must be deployed locally to meet the strict latency requirements of real-time applications. This paper characterizes VLA performance on two generations of edge hardware, viz. the Nvidia Jetson Orin and Thor platforms. Using MolmoAct-7B, a state-of-the-art VLA model, we identify a primary execution bottleneck: up to 75% of end-to-end latency is consumed by the memory-bound action-generation phase. Through analytical modeling and simulations, we project the hardware requirements for scaling to 100B parameter models. We also explore the impact of high-bandwidth memory technologies and processing-in-memory (PIM) as promising future pathways in edge systems for embodied AI.
翻译:视觉-语言-动作(VLA)模型是新兴的一类关键工作负载,对于机器人和边缘端的具身AI至关重要。随着这些模型的规模扩大,它们展现出显著的能力提升,但必须部署在本地以满足实时应用的严格延迟要求。本文在两代边缘硬件平台(即Nvidia Jetson Orin和Thor平台)上对VLA模型的性能进行了特征分析。通过使用最先进的VLA模型MolmoAct-7B,我们识别出一个主要的执行瓶颈:高达75%的端到端延迟消耗在内存受限的动作生成阶段。通过分析建模和仿真,我们预测了扩展到100B参数模型所需的硬件要求。此外,我们还探讨了高带宽内存技术和内存内处理(PIM)作为未来边缘系统中具身AI发展的有前景的技术路径。