DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the model to terminate processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation. Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (i.e., power consumption), as well as peak computational consumption (i.e., latency) and GPU memory usage. These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance. On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance. Code and checkpoints are available at https://github.com/yueyang130/DeeR-VLA.

翻译：多模态大语言模型（MLLMs）在处理复杂语言与视觉数据方面已展现出卓越的理解与推理能力。这些进展激发了构建通用机器人MLLM的愿景，使其能够理解复杂的人类指令并完成多样化的具身任务。然而，由于机器人平台通常具有有限的计算与内存容量，为真实世界机器人开发MLLM面临挑战。相比之下，MLLM的推理过程涉及存储数十亿参数并进行海量计算，对硬件提出了极高要求。本文提出一种面向机器人视觉-语言-动作模型的动态早退框架（DeeR-VLA，简称DeeR），该框架能够根据当前情境自动调整激活的MLLM规模。该方法利用MLLM中的多出口架构，使模型在特定情境下激活适当规模后即可终止处理，从而避免后续冗余计算。此外，我们开发了新颖算法，为DeeR建立了基于预设需求（如平均计算成本（即功耗）、峰值计算消耗（即延迟）及GPU内存使用量）的早退判定准则。这些改进确保DeeR在变化的资源约束下高效运行，同时保持竞争力性能。在CALVIN机器人操作基准测试中，DeeR在保持性能不变的前提下，将LLM计算成本降低5.2-6.5倍，LLM的GPU内存占用减少2-6倍。代码与模型检查点公开于：https://github.com/yueyang130/DeeR-VLA。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日