Autoregressive Transformers adopted in Large Language Models (LLMs) are hard to scale to long sequences. Despite several works trying to reduce their computational cost, most of LLMs still adopt attention layers between all pairs of tokens in the sequence, thus incurring a quadratic cost. In this study, we present a novel approach that dynamically prunes contextual information while preserving the model's expressiveness, resulting in reduced memory and computational requirements during inference. Our method employs a learnable mechanism that determines which uninformative tokens can be dropped from the context at any point across the generation process. By doing so, our approach not only addresses performance concerns but also enhances interpretability, providing valuable insight into the model's decision-making process. Our technique can be applied to existing pre-trained models through a straightforward fine-tuning process, and the pruning strength can be specified by a sparsity parameter. Notably, our empirical findings demonstrate that we can effectively prune up to 80\% of the context without significant performance degradation on downstream tasks, offering a valuable tool for mitigating inference costs. Our reference implementation achieves up to $2\times$ increase in inference throughput and even greater memory savings.
翻译:大型语言模型(LLMs)中采用的自回归Transformer模型难以扩展至长序列。尽管已有若干研究尝试降低其计算成本,但大多数LLMs仍采用序列中所有词元对之间的注意力层,导致计算复杂度呈二次方增长。本研究提出一种新颖的方法,能在保持模型表达能力的同时动态剪枝上下文信息,从而降低推理过程中的内存与计算需求。该方法采用可学习机制,动态判定在生成过程的任意阶段哪些信息量较低的词元可从上下文中剔除。通过这种方式,我们的方法不仅解决了性能瓶颈,还增强了模型的可解释性,为理解模型的决策过程提供了有价值的视角。该技术可通过简单的微调过程应用于现有预训练模型,且剪枝强度可通过稀疏度参数进行调控。值得注意的是,实验结果表明,在下游任务中即使剪除高达80%的上下文信息,模型性能也未出现显著下降,这为降低推理成本提供了有效工具。我们的参考实现实现了最高达$2\times$的推理吞吐量提升,并带来更显著的内存节约。