Network-aided Efficient Large Language Model Services With Denoising-inspired Prompt Compression

Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, leading to their increasing adoption in diverse services delivered through wireless networks. There is a growing trend toward longer prompts to better leverage LLMs' capabilities and address difficult tasks. However, longer prompts not only increase data transmission costs across wireless transmission but also require more computing resources and processing time, impacting the overall system efficiency and user experience. To address this challenge, we propose Joint Power and Prompt Optimization (JPPO), a framework that combines Small Language Model (SLM)-based prompt compression with wireless power allocation optimization. By deploying SLM at edge devices for prompt compression and employing Deep Reinforcement Learning (DRL) for joint optimization of compression ratio and transmission power, JPPO effectively balances service quality with resource efficiency. Furthermore, inspired by denoising diffusion models, we design a denoising-inspired prompt compression approach that iteratively compresses prompts by gradually removing non-critical information. Experimental results demonstrate that our framework achieves high service fidelity while optimizing power usage in wireless LLM services, reducing the total service response time. With our DRL-based JPPO, the framework maintains fidelity comparable to the no-compression baseline while still achieving a 17% service time reduction through adaptive compression. When prioritizing compression, our framework achieves up to 16x compression ratio while maintaining acceptable fidelity (within 30% reduction). Compared to no compression, baseline single-round compression with a 16x compression ratio reduces the system total response time by approximately 42.3%, while the denoising-inspired method achieves a 46.5% service time-saving.

翻译：大语言模型（LLM）已在多种任务中展现出卓越能力，正日益广泛地应用于通过无线网络提供的各类服务中。为更好地利用大语言模型能力并处理复杂任务，使用更长提示词的趋势日益明显。然而，更长的提示词不仅会增加无线传输的数据开销，还需要更多计算资源和处理时间，从而影响整体系统效率和用户体验。为应对这一挑战，我们提出了联合功率与提示优化框架，该框架将基于小语言模型的提示压缩与无线功率分配优化相结合。通过在边缘设备部署小语言模型进行提示压缩，并采用深度强化学习联合优化压缩比与传输功率，该框架有效平衡了服务质量与资源效率。此外，受去噪扩散模型启发，我们设计了一种去噪启发的提示压缩方法，通过逐步移除非关键信息迭代压缩提示词。实验结果表明，我们的框架在优化无线大语言模型服务中功率使用的同时，保持了较高的服务保真度，并降低了总体服务响应时间。基于深度强化学习的联合优化框架在保持与无压缩基线相当的保真度同时，通过自适应压缩实现了17%的服务时间缩减。在优先压缩的场景下，我们的框架可实现高达16倍的压缩比，同时维持可接受的保真度（降低幅度在30%以内）。与无压缩相比，采用16倍压缩比的基线单轮压缩方法可减少约42.3%的系统总响应时间，而去噪启发方法则实现了46.5%的服务时间节省。