Generation Quality-Latency Tradeoff-Aware Inference Offloading for Multimodal LLMs in Cloud-Edge Continuum

Beyond pure cloud, some efforts are being made to deploy Large Language Models (LLMs) in edge to accelerate inference response. So the deployment of LLMs in cloud-edge continuum becomes a promising paradigm, where the tasks involving multimodal data occupy a large part of requests. Under this continuum, users usually concern about multiple Quality-of-Service (QoS) attributes, but it is always intractable to jointly optimize them. In this paper, we propose to study the joint optimization of those attributes and focus on two key representatives, i.e., content generation quality and response latency. We propose to study the offloading technology to achieve a tradeoff between the two objectives in the cloud-edge collaborative Multimodal LLM (MLLM) system. However, it is highly difficult to predict generation quality and inference latency for MLLM inference tasks while optimizing this offloading process. To address these unprecedented difficulties, we propose a Quality-Latency Tradeoff-Aware MLLM Inference Offloading (QLMIO) framework to make decisions that optimally balance generation quality and response latency. Meanwhile, recognizing the absence of publicly available datasets tailored to the MLLM inference offloading problem, we constructed a real-world cloud-edge collaborative MLLM system and subsequently collected an MLLM Inference Offloading Benchmark (MIOBench) to comprehensively evaluate our framework and facilitate the study of this problem. Extensive experimental results demonstrate that the QLMIO framework reduces latency by up to 58.14\% compared to baselines, while simultaneously matching the task completion rate achieved under the case that executes all requests exclusively on a cloud server. The dataset and codes are available at Github.

翻译：超越纯云端部署，已有研究尝试将大语言模型（LLM）部署至边缘以加速推理响应。因此，在云边连续体中部署LLM成为一种有前景的范式，其中涉及多模态数据的任务占据大部分请求。在此连续体下，用户通常关注多个服务质量（QoS）属性，但联合优化这些属性始终较为棘手。本文提出研究这些属性的联合优化问题，并聚焦于两个关键代表：内容生成质量与响应延迟。我们通过研究卸载技术，在云边协同的多模态大语言模型（MLLM）系统中实现两者的权衡。然而，在优化卸载过程的同时，预测MLLM推理任务的生成质量与推理延迟极具挑战性。为应对这些前所未有的困难，我们提出了一种质量-延迟权衡感知的多模态大语言模型推理卸载（QLMIO）框架，用于做出最优平衡生成质量与响应延迟的决策。同时，由于缺乏公开可用的、针对MLLM推理卸载问题的数据集，我们构建了真实的云边协同MLLM系统，并据此收集了多模态大语言模型推理卸载基准（MIOBench），以全面评估所提框架并促进该问题的研究。大量实验结果表明，与基准方法相比，QLMIO框架在延迟上最高降低58.14%，同时达到与将所有请求全部在云端服务器执行情况下相同的任务完成率。数据集与代码已发布于Github。