In this paper, we introduce a novel low-latency inference framework for large language models (LLMs) inference which enables LLMs to perform inferences with incomplete prompts. By reallocating computational processes to prompt input phase, we achieve a substantial reduction in latency, thereby significantly enhancing the interactive experience for users of LLMs. The framework adeptly manages the visibility of the streaming prompt to the model, allowing it to infer from incomplete prompts or await additional prompts. Compared with traditional inference methods that utilize complete prompts, our approach demonstrates an average reduction of 59% in response latency on the MMLU-Pro dataset, while maintaining comparable accuracy. Additionally, our framework facilitates collaborative inference and output across different models. By employing an LLM for inference and a small language model (SLM) for output, we achieve an average 68% reduction in response latency, alongside a 5.5% improvement in accuracy on the MMLU-Pro dataset compared with the SLM baseline. For long prompts exceeding 20 sentences, the response latency can be reduced by up to 93%.
翻译:本文提出了一种新颖的低延迟大语言模型推理框架,使大语言模型能够基于不完整的提示进行推理。通过将部分计算过程重新分配到提示输入阶段,我们实现了延迟的大幅降低,从而显著提升了大语言模型用户的交互体验。该框架巧妙地管理流式提示对模型的可见性,使其能够基于不完整提示进行推理或等待更多提示。与使用完整提示的传统推理方法相比,我们的方法在MMLU-Pro数据集上平均降低了59%的响应延迟,同时保持了相当的准确性。此外,我们的框架支持不同模型之间的协同推理与输出。通过采用大语言模型进行推理、小语言模型进行输出的策略,相较于小语言模型基线,我们在MMLU-Pro数据集上平均降低了68%的响应延迟,同时准确率提升了5.5%。对于超过20句的长提示,响应延迟最高可降低93%。