Compact LLM Deployment and World Model Assisted Offloading in Mobile Edge Computing

This paper investigates compact large language model (LLM) deployment and world-model-assisted inference offloading in mobile edge computing (MEC) networks. We first propose an edge compact LLM deployment (ECLD) framework that jointly applies structured pruning, low-bit quantization, and knowledge distillation to construct edge-deployable LLM variants, and we evaluate these models using four complementary metrics: accessibility, energy consumption, hallucination rate, and generalization accuracy. Building on the resulting compact models, we formulate an MEC offloading optimization problem that minimizes the long-term average inference latency subject to per-device energy budgets and LLM-specific quality-of-service constraints on effective accuracy and hallucination. To solve this problem under unknown and time-varying network dynamics, we develop a world model-proximal policy optimization (PPO) algorithm, which augments an on-policy PPO algorithm with a learned recurrent world model that provides improved value targets and short imagination rollouts. Extensive experiments on Llama-3.1-8B, Qwen3-8B, and Mistral-12B show that ECLD compresses base models by about 70-80% in storage (i.e., from 15.3 GB to 3.3 GB for Llama-3.1-8B) and reduces per-query energy consumption by up to 50%, while largely preserving accuracy and often lowering hallucination compared with quantization-only or pruning-only baselines. Moreover, they also show that world model-PPO speeds up convergence by about 50%, improves the final reward by 15.8% over vanilla PPO, and reduces average inference latency by 12-30% across different user populations, while satisfying the accuracy and hallucination constraints and approaching the generation quality of always-offloading with much of the efficiency of local execution.

翻译：本文研究了移动边缘计算（MEC）网络中紧凑型大语言模型（LLM）的部署及基于世界模型的推理卸载问题。首先提出边缘紧凑型LLM部署（ECLD）框架，该框架联合应用结构化剪枝、低比特量化和知识蒸馏技术构建可部署于边缘的LLM变体，并通过四个互补指标评估这些模型：可访问性、能耗、幻觉率和泛化准确性。基于所得到的紧凑模型，我们构建了一个MEC卸载优化问题，在满足每设备能量预算以及有效准确率和幻觉等LLM特有的服务质量约束条件下，最小化长期平均推理延迟。为解决未知且时变网络动态下的该问题，我们开发了一种基于世界模型的近端策略优化（PPO）算法，该算法通过一个学习到的循环世界模型增强在线PPO算法，提供改进的价值目标与短时想象推演。在Llama-3.1-8B、Qwen3-8B和Mistral-12B上的大量实验表明，ECLD可将基础模型存储压缩约70–80%（例如Llama-3.1-8B从15.3 GB降至3.3 GB），并使单次查询能耗降低高达50%，同时与仅量化或仅剪枝的基线方法相比，在基本保持准确性的同时往往降低幻觉率。此外，实验还表明，世界模型-PPO算法比原始PPO算法收敛速度提升约50%，最终奖励提高15.8%，在不同用户群体中平均推理延迟降低12–30%，同时满足准确性和幻觉约束，并接近始终卸载方法的生成质量，兼具本地执行的高效率。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

从静态模板到动态运行时图：大语言模型智能体（LLM Agents）工作流优化综述

专知会员服务

23+阅读 · 3月30日

综述：面向移动端大语言模型的隐私与安全

专知会员服务

19+阅读 · 2025年9月7日

【新书】解码大型语言模型：理解、实现与优化LLM在自然语言处理应用中的全面指南

专知会员服务

49+阅读 · 2024年12月13日

【NeurIPS2024】《AmoebaLLM：构建任意形状的大型语言模型以实现高效和即时部署》

专知会员服务

22+阅读 · 2024年11月21日