Autoregressive Transformers adopted in Large Language Models (LLMs) are hard to scale to long sequences. Despite several works trying to reduce their computational cost, most of LLMs still adopt attention layers between all pairs of tokens in the sequence, thus incurring a quadratic cost. In this study, we present a novel approach that dynamically prunes contextual information while preserving the model's expressiveness, resulting in reduced memory and computational requirements during inference. Our method employs a learnable mechanism that determines which uninformative tokens can be dropped from the context at any point across the generation process. By doing so, our approach not only addresses performance concerns but also enhances interpretability, providing valuable insight into the model's decision-making process. Our technique can be applied to existing pre-trained models through a straightforward fine-tuning process, and the pruning strength can be specified by a sparsity parameter. Notably, our empirical findings demonstrate that we can effectively prune up to 80\% of the context without significant performance degradation on downstream tasks, offering a valuable tool for mitigating inference costs. Our reference implementation achieves up to $2\times$ increase in inference throughput and even greater memory savings.
翻译:大型语言模型(LLMs)中采用的自回归Transformer难以扩展到长序列。尽管已有研究尝试降低其计算成本,但大多数LLM仍对所有标记对之间的注意力层进行全连接运算,导致二次方复杂度。本研究提出一种新方法:在保持模型表达能力的同时动态剪枝上下文信息,从而降低推理过程中的内存与计算需求。该方法采用可学习机制,在生成过程中任意位置判定哪些非信息性标记可以从上下文中移除。通过这种方式,我们的方法不仅解决了性能问题,还增强了可解释性,为模型决策过程提供了宝贵洞察。该技术可通过简单的微调流程应用于现有预训练模型,且剪枝强度可通过稀疏度参数指定。值得注意的是,实验结果表明,我们能在下游任务性能无明显下降的前提下有效剪枝高达80%的上下文,为降低推理成本提供了有效工具。参考实现可实现高达2倍的推理吞吐量提升,并带来更显著的内存节省。