Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. However, not all requests posed to LLMs are equally difficult to handle. Through analysis, we show that for some tasks, LLMs can achieve results comparable to the final output at some intermediate layers. That is, not all layers of LLMs are necessary during inference. If we can predict at which layer the inferred results match the final results (produced by evaluating all layers), we could significantly reduce the inference cost. To this end, we propose a simple yet effective algorithm named AdaInfer to adaptively terminate the inference process for an input instance. AdaInfer relies on easily obtainable statistical features and classic classifiers like SVM. Experiments on well-known LLMs like the Llama2 series and OPT, show that AdaInfer can achieve an average of 17.8% pruning ratio, and up to 43% on sentiment tasks, with nearly no performance drop (<1%). Because AdaInfer does not alter LLM parameters, the LLMs incorporated with AdaInfer maintain generalizability across tasks.
翻译:由于参数量庞大,大语言模型(LLMs)的推理阶段需要消耗大量计算资源。然而,并非所有输入LLMs的请求都具有相同的处理难度。通过分析,我们发现对于某些任务,LLMs在部分中间层即可获得与最终输出相当的结果。这意味着在推理过程中,并非所有层都是必需的。若能预测在哪个中间层的推断结果已与完整层评估产生的最终结果相匹配,我们就能显著降低推理成本。为此,我们提出一种简单而有效的算法AdaInfer,可根据输入实例自适应地终止推理过程。AdaInfer依赖于易于获取的统计特征和经典分类器(如支持向量机)。在Llama2系列、OPT等知名LLMs上的实验表明,AdaInfer平均可实现17.8%的层剪枝率,在情感分析任务中最高可达43%,且性能损失极小(<1%)。由于AdaInfer不修改LLM参数,集成该算法的LLMs能够保持跨任务的泛化能力。