Multimodal large language models (MLLMs) enable powerful cross-modal reasoning capabilities but impose substantial computational and latency burdens, posing critical challenges for deployment on resource-constrained edge devices. In this paper, we propose MSAO, an adaptive modality sparsity-aware offloading framework with edge-cloud collaboration for efficient MLLM Inference. First, a lightweight heterogeneous modality-aware via fine-grained sparsity module performs spatial-temporal-modal joint analysis to compute the Modality Activation Sparsity (MAS) metric, which quantifies the necessity of each modality with minimal computational overhead. Second, an adaptive speculative edge-cloud collaborative offloading mechanism dynamically schedules workloads between edge and cloud based on the derived MAS scores and real-time system states, leveraging confidence-guided speculative execution to hide communication latency. Extensive experiments on VQAv2 and MMBench benchmarks demonstrate that MSAO achieves a 30% reduction in end-to-end latency and 30%-65% decrease in resource overhead, while delivering a throughput improvement of 1.5x to 2.3x compared to traditional approaches, all without compromising competitive accuracy.
翻译:多模态大语言模型(MLLMs)具备强大的跨模态推理能力,但会带来巨大的计算和延迟负担,对在资源受限的边缘设备上部署构成严峻挑战。本文提出MSAO——一种面向高效多模态大语言模型推理的自适应模态稀疏感知边缘-云协同卸载框架。首先,采用轻量级异构模态感知精细粒度稀疏模块,通过时空-模态联合分析计算模态激活稀疏度(MAS)指标,以最小计算开销量化各模态的必要性。其次,设计自适应推测性边缘-云协同卸载机制,基于推导的MAS分数与实时系统状态动态调度边缘与云端工作负载,利用置信度引导的推测执行隐藏通信延迟。在VQAv2与MMBench基准上的大量实验表明,与传统方法相比,MSAO在保持竞争性精度的前提下,端到端延迟降低30%,资源开销减少30%-65%,吞吐量提升1.5至2.3倍。