Edge Intelligence Optimization for Large Language Model Inference with Batching and Quantization

Generative Artificial Intelligence (GAI) is taking the world by storm with its unparalleled content creation ability. Large Language Models (LLMs) are at the forefront of this movement. However, the significant resource demands of LLMs often require cloud hosting, which raises issues regarding privacy, latency, and usage limitations. Although edge intelligence has long been utilized to solve these challenges by enabling real-time AI computation on ubiquitous edge resources close to data sources, most research has focused on traditional AI models and has left a gap in addressing the unique characteristics of LLM inference, such as considerable model size, auto-regressive processes, and self-attention mechanisms. In this paper, we present an edge intelligence optimization problem tailored for LLM inference. Specifically, with the deployment of the batching technique and model quantization on resource-limited edge devices, we formulate an inference model for transformer decoder-based LLMs. Furthermore, our approach aims to maximize the inference throughput via batch scheduling and joint allocation of communication and computation resources, while also considering edge resource constraints and varying user requirements of latency and accuracy. To address this NP-hard problem, we develop an optimal Depth-First Tree-Searching algorithm with online tree-Pruning (DFTSP) that operates within a feasible time complexity. Simulation results indicate that DFTSP surpasses other batching benchmarks in throughput across diverse user settings and quantization techniques, and it reduces time complexity by over 45% compared to the brute-force searching method.

翻译：生成式人工智能正以其无与伦比的内容创作能力席卷全球，大型语言模型（LLM）处于这一浪潮的前沿。然而，LLM对资源的巨大需求通常需要云托管，这引发了隐私、延迟和使用限制等问题。尽管边缘智能长期以来被用于通过利用靠近数据源的泛在边缘资源实现实时AI计算来解决这些挑战，但大多数研究聚焦于传统AI模型，未能充分应对LLM推理的独特特性，如庞大的模型规模、自回归过程和自注意力机制。本文提出了一种专为LLM推理设计的边缘智能优化问题。具体而言，通过在资源受限的边缘设备上部署批处理技术和模型量化，我们为基于Transformer解码器的LLM构建了一个推理模型。此外，我们的方法旨在通过批调度以及通信与计算资源的联合分配来最大化推理吞吐量，同时考虑边缘资源约束和用户对延迟及准确性的差异化需求。为应对这一NP难问题，我们开发了一种在可行时间复杂度内运行的最优深度优先树搜索与在线剪枝算法（DFTSP）。仿真结果表明，在不同用户设置和量化技术下，DFTSP在吞吐量上优于其他批处理基准，且相比暴力搜索方法，时间复杂度降低了超过45%。