Recently, dynamic computation methods have shown notable acceleration for Large Language Models (LLMs) by skipping several layers of computations through elaborate heuristics or additional predictors. However, in the decoding process of existing approaches, different samples are assigned different computational budgets, which cannot guarantee a stable and precise acceleration effect. Furthermore, existing approaches generally skip multiple contiguous layers at the bottom or top of the layers, leading to a drastic change in the model's layer-wise representations, and thus a consequent performance degeneration. Therefore, we propose a Unified Layer Skipping strategy, which selects the number of layers to skip computation based solely on the target speedup ratio, and then skips the corresponding number of intermediate layer computations in a balanced manner. Since the Unified Layer Skipping strategy is independent of input samples, it naturally supports popular acceleration techniques such as batch decoding and KV caching, thus demonstrating more practicality for real-world applications. Experimental results on two common tasks, i.e., machine translation and text summarization, indicate that given a target speedup ratio, the Unified Layer Skipping strategy significantly enhances both the inference performance and the actual model throughput over existing dynamic approaches.
翻译:近期,动态计算方法通过精心设计的启发式规则或额外预测器跳过若干层计算,在大型语言模型(LLMs)中展现出显著的加速效果。然而,现有方法在解码过程中为不同样本分配不同计算预算,无法保证稳定精准的加速效果。此外,现有方法通常跳过层级底部或顶部连续多层,导致模型逐层表征发生剧烈变化,进而引发性能退化。为此,我们提出统一层跳跃策略:仅依据目标加速比选择需要跳过计算的层级数量,并以均衡方式跳过对应数量的中间层计算。由于统一层跳跃策略与输入样本无关,天然支持批量解码和键值缓存等主流加速技术,在真实应用场景中更具实用性。在机器翻译与文本摘要两项常见任务上的实验结果表明,给定目标加速比时,统一层跳跃策略在推理性能和实际模型吞吐量上均显著超越现有动态方法。