MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the model to terminate processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation. Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (i.e., power consumption), as well as peak computational consumption (i.e., latency) and GPU memory usage. These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance. On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance. Code and checkpoints are available at https://github.com/yueyang130/DeeR-VLA.
翻译:多模态大语言模型(MLLMs)在处理复杂语言与视觉数据方面已展现出卓越的理解与推理能力。这些进展激发了构建通用机器人MLLM的愿景,使其能够理解复杂的人类指令并完成多样化的具身任务。然而,由于机器人平台通常具有有限的计算与内存容量,为真实世界机器人开发MLLM面临挑战。相比之下,MLLM的推理过程涉及存储数十亿参数并进行海量计算,对硬件提出了极高要求。本文提出一种面向机器人视觉-语言-动作模型的动态早退框架(DeeR-VLA,简称DeeR),该框架能够根据当前情境自动调整激活的MLLM规模。该方法利用MLLM中的多出口架构,使模型在特定情境下激活适当规模后即可终止处理,从而避免后续冗余计算。此外,我们开发了新颖算法,为DeeR建立了基于预设需求(如平均计算成本(即功耗)、峰值计算消耗(即延迟)及GPU内存使用量)的早退判定准则。这些改进确保DeeR在变化的资源约束下高效运行,同时保持竞争力性能。在CALVIN机器人操作基准测试中,DeeR在保持性能不变的前提下,将LLM计算成本降低5.2-6.5倍,LLM的GPU内存占用减少2-6倍。代码与模型检查点公开于:https://github.com/yueyang130/DeeR-VLA。