The remarkable capabilities of Large Language Models (LLMs) are overshadowed by their immense computational cost. While recent work has shown that many LLM layers can be reordered or even removed with minimal impact on accuracy, these insights have not been translated into significant inference speedups. To bridge this gap, we introduce a novel method that restructures the computational graph by grouping and evaluating consecutive layer pairs in parallel. This approach, requiring no retraining, yields a 1.19x throughput gain on Llama 2 7B while reducing the average benchmark accuracy by only 1.5\%. We demonstrate the practical value of this method for large-scale LLM deployment and show that some of the lost accuracy can be recovered with lightweight fine-tuning of the parallelized layers.
翻译:大语言模型(LLMs)的卓越能力被其巨大的计算成本所掩盖。尽管近期研究表明,许多LLM层可以重新排序甚至移除,而对精度影响甚微,但这些洞见尚未转化为显著的推理加速。为弥合这一差距,我们提出了一种新颖的方法,通过并行分组和评估连续层对来重构计算图。该方法无需重新训练,在Llama 2 7B上实现了1.19倍的吞吐量提升,而基准测试平均精度仅下降1.5%。我们证明了该方法在大规模LLM部署中的实用价值,并表明通过并行化层的轻量级微调可以恢复部分损失的精度。