Existing works on large language model (LLM) decomposition mainly focus on improving performance on downstream tasks, but they ignore the poor parallel inference performance when trying to scale up the model size. To mitigate this important performance issue, this paper introduces DeInfer, a high-performance inference system dedicated to parallel inference of decomposed LLMs. It consists of multiple optimizations to maximize performance and be compatible with state-of-the-art optimization techniques. Extensive experiments are carried out to evaluate DeInfer's performance, where the results demonstrate its superiority, suggesting it can greatly facilitate the parallel inference of decomposed LLMs.
翻译:现有的大语言模型分解工作主要关注于提升下游任务性能,但忽略了在扩大模型规模时并行推理性能低下的问题。为解决这一重要性能瓶颈,本文提出DeInfer——一个专为分解式大语言模型并行推理设计的高性能推理系统。该系统融合了多项优化技术以最大化性能,并与当前最先进的优化方法兼容。通过大量实验对DeInfer进行性能评估,结果证明了其优越性,表明该系统能显著促进分解式大语言模型的并行推理。