From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

Vision-Language-Action (VLA) models have recently enabled embodied agents to perform increasingly complex tasks by jointly reasoning over visual, linguistic, and motor modalities. However, we find that the prevailing notion of ``efficiency'' in current VLA research, characterized by parameters, FLOPs, or token decoding throughput, does not reflect actual performance on robotic platforms. In real-world execution, efficiency is determined by system-level embodied behaviors such as task completion time, trajectory smoothness, cumulative joint rotation, and motion energy. Through controlled studies across model compression, token sparsification, and action sequence compression, we make several observations that challenge common assumptions. (1) Methods that reduce computation under conventional metrics often increase end-to-end execution cost or degrade motion quality, despite maintaining task success rates. (2) System-level embodied efficiency metrics reveal performance differences in the learned action policies that remain hidden under conventional evaluations. (3) Common adaptation methods such as in-context prompting or supervised fine-tuning show only mild and metric-specific improvements in embodied efficiency. While these methods can reduce targeted embodied-efficiency metrics such as jerk or action rate, the resulting gains may come with trade-offs in other metrics, such as longer completion time. Taken together, our results suggest that conventional inference efficiency metrics can overlook important aspects of embodied execution. Incorporating embodied efficiency provides a more complete view of policy behavior and practical performance, enabling fairer and more comprehensive comparisons of VLA models.

翻译：视觉-语言-动作（VLA）模型通过联合推理视觉、语言和运动模态，使具身智能体能够执行日益复杂的任务。然而，我们发现当前VLA研究中以参数量、FLOPs或令牌解码吞吐量表征的“效率”主流概念，并未反映机器人平台的实际性能。在真实环境执行中，效率由系统级具身行为决定，例如任务完成时间、轨迹平滑度、关节累积旋转角度及运动能耗。通过对模型压缩、令牌稀疏化与动作序列压缩的受控研究，我们提出若干挑战常规假设的发现：（1）在传统度量标准下降低计算量的方法，往往增加端到端执行成本或降低运动质量，尽管任务成功率得以维持；（2）系统级具身效率度量可揭示学习动作策略中的性能差异，而这些差异在传统评估中难以显现；（3）上下文提示学习或监督微调等常见自适应方法仅对具身效率产生轻微且度量子集相关的改进。尽管这些方法能降低目标具身效率指标（如加加速度或动作频率），但可能伴随其他指标（如完成时间延长）的权衡。综合而言，我们的结果表明传统推理效率度量可能忽视具身执行的重要方面。融入具身效率能够更全面地呈现策略行为与实用性能，从而实现VLA模型更公平、更系统的比较。