Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks. However, the inference process for LLMs comes with significant computational costs. In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs. Our approach begins by tapping into the potential of LLMs to accurately perceive and predict the response length with minimal overhead. By leveraging this information, we introduce an efficient sequence scheduling technique that groups queries with similar response lengths into micro-batches. We evaluate our approach on real-world instruction datasets using the LLaMA-based model, and our results demonstrate an impressive 86% improvement in inference throughput without compromising effectiveness. Notably, our method is orthogonal to other inference acceleration techniques, making it a valuable addition to many existing toolkits (e.g., FlashAttention, Quantization) for LLM inference.
翻译:大语言模型(LLMs)已彻底改变人工智能领域,在各种任务中展现出前所未有的能力。然而,LLMs的推理过程伴随着高昂的计算成本。本文提出了一种高效的LLM推理流水线,该流水线充分利用了LLM自身的能力。我们的方法首先通过挖掘LLM的潜能,以极低的开销精确感知和预测响应长度。利用这一信息,我们引入了一种高效的序列调度技术,将具有相似响应长度的查询分组为微批次。我们基于LLaMA模型在真实指令数据集上评估了该方法,结果表明在不影响效果的前提下,推理吞吐量实现了86%的显著提升。值得注意的是,我们的方法与其他推理加速技术正交,使其成为现有众多LLM推理工具包(例如FlashAttention、量化)的重要补充。