Existing works on large language model (LLM) decomposition mainly focus on improving performance on downstream tasks, but they ignore the poor parallel inference performance when trying to scale up the model size. To mitigate this important performance issue, this paper introduces DeInfer, a high-performance inference system dedicated to parallel inference of decomposed LLMs. It consists of multiple optimizations to maximize performance and be compatible with state-of-the-art optimization techniques. Extensive experiments are carried out to evaluate DeInfer's performance, where the results demonstrate its superiority, suggesting it can greatly facilitate the parallel inference of decomposed LLMs.
翻译:现有大语言模型分解研究主要聚焦于提升下游任务性能,但忽视了模型规模扩展时并行推理效率低下的问题。为缓解这一关键性能瓶颈,本文提出DeInfer——一个专为分解式大语言模型并行推理设计的高性能推理系统。该系统融合多项优化策略以实现性能最大化,并具备与前沿优化技术的兼容性。通过大量实验评估DeInfer的性能,实验结果证明其优越性,表明该系统能显著促进分解式大语言模型的并行推理进程。