This paper addresses the inherent inference latency challenges associated with deploying multimodal large language models (MLLM) in quadruped vision-language-action (QUAR-VLA) tasks. Our investigation reveals that conventional parameter reduction techniques ultimately impair the performance of the language foundation model during the action instruction tuning phase, making them unsuitable for this purpose. We introduce a novel latency-free quadruped MLLM model, dubbed QUART-Online, designed to enhance inference efficiency without degrading the performance of the language foundation model. By incorporating Action Chunk Discretization (ACD), we compress the original action representation space, mapping continuous action values onto a smaller set of discrete representative vectors while preserving critical information. Subsequently, we fine-tune the MLLM to integrate vision, language, and compressed actions into a unified semantic space. Experimental results demonstrate that QUART-Online operates in tandem with the existing MLLM system, achieving real-time inference in sync with the underlying controller frequency, significantly boosting the success rate across various tasks by 65%. Our project page is https://quart-online.github.io.
翻译:本文旨在解决在四足视觉-语言-动作(QUAR-VLA)任务中部署多模态大语言模型(MLLM)所固有的推理延迟挑战。我们的研究发现,传统的参数缩减技术最终会损害语言基础模型在动作指令微调阶段的性能,使其不适用于此目的。我们提出了一种新颖的无延迟四足MLLM模型,命名为QUART-Online,旨在提升推理效率,同时不降低语言基础模型的性能。通过引入动作分块离散化(ACD),我们压缩了原始的动作表示空间,将连续的动作值映射到一个更小的离散代表向量集合上,同时保留了关键信息。随后,我们对MLLM进行微调,将视觉、语言和压缩后的动作整合到一个统一的语义空间中。实验结果表明,QUART-Online与现有的MLLM系统协同工作,实现了与底层控制器频率同步的实时推理,将各种任务的成功率显著提升了65%。我们的项目页面是 https://quart-online.github.io。