This paper investigates compact large language model (LLM) deployment and world-model-assisted inference offloading in mobile edge computing (MEC) networks. We first propose an edge compact LLM deployment (ECLD) framework that jointly applies structured pruning, low-bit quantization, and knowledge distillation to construct edge-deployable LLM variants, and we evaluate these models using four complementary metrics: accessibility, energy consumption, hallucination rate, and generalization accuracy. Building on the resulting compact models, we formulate an MEC offloading optimization problem that minimizes the long-term average inference latency subject to per-device energy budgets and LLM-specific quality-of-service constraints on effective accuracy and hallucination. To solve this problem under unknown and time-varying network dynamics, we develop a world model-proximal policy optimization (PPO) algorithm, which augments an on-policy PPO algorithm with a learned recurrent world model that provides improved value targets and short imagination rollouts. Extensive experiments on Llama-3.1-8B, Qwen3-8B, and Mistral-12B show that ECLD compresses base models by about 70-80% in storage (i.e., from 15.3 GB to 3.3 GB for Llama-3.1-8B) and reduces per-query energy consumption by up to 50%, while largely preserving accuracy and often lowering hallucination compared with quantization-only or pruning-only baselines. Moreover, they also show that world model-PPO speeds up convergence by about 50%, improves the final reward by 15.8% over vanilla PPO, and reduces average inference latency by 12-30% across different user populations, while satisfying the accuracy and hallucination constraints and approaching the generation quality of always-offloading with much of the efficiency of local execution.
翻译:本文研究移动边缘计算(MEC)网络中紧凑型大语言模型(LLM)部署与基于世界模型的推理卸载问题。我们首先提出边缘紧凑LLM部署(ECLD)框架,该框架联合应用结构化剪枝、低比特量化和知识蒸馏以构建可部署于边缘的LLM变体,并使用四个互补指标评估这些模型:可访问性、能耗、幻觉率和泛化准确率。基于所得紧凑模型,我们形式化一个MEC卸载优化问题,在满足每设备能量预算以及有效准确率和幻觉等LLM特定服务质量约束的条件下最小化长期平均推理延迟。为解决未知且时变网络动态下的该问题,我们提出一种世界模型-近端策略优化(PPO)算法,该算法用学习到的循环世界模型增强在线PPO算法,提供改进的价值目标与短想象轨迹。在Llama-3.1-8B、Qwen3-8B和Mistral-12B上的广泛实验表明:ECLD将基础模型存储压缩约70-80%(如Llama-3.1-8B从15.3 GB降至3.3 GB),单查询能耗降低高达50%,同时与仅量化或仅剪枝基线相比,基本保持准确率且常降低幻觉率。此外,实验还表明:与标准PPO相比,世界模型-PPO加速收敛约50%,最终奖励提升15.8%,在不同用户规模下将平均推理延迟降低12-30%,同时满足准确率和幻觉约束,其生成质量接近始终卸载方案而效率接近本地执行。