Long context inference presents challenges at the system level with increased compute and memory requirements, as well as from an accuracy perspective in being able to reason over long contexts. Recently, several methods have been proposed to compress the prompt to reduce the context length. However, there has been little work on comparing the different proposed methods across different tasks through a standardized analysis. This has led to conflicting results. To address this, here we perform a comprehensive characterization and evaluation of different prompt compression methods. In particular, we analyze extractive compression, summarization-based abstractive compression, and token pruning methods. Surprisingly, we find that extractive compression often outperforms all the other approaches, and enables up to 10x compression with minimal accuracy degradation. Interestingly, we also find that despite several recent claims, token pruning methods often lag behind extractive compression. We only found marginal improvements on summarization tasks.
翻译:长上下文推理在系统层面带来了计算与内存需求增加的挑战,同时从准确性角度也面临着对长上下文进行有效推理的困难。近期,研究者提出了多种压缩提示以减少上下文长度的方法。然而,目前仍缺乏通过标准化分析对不同任务中各类方法进行比较的系统性研究,这导致了相互矛盾的结论。为此,本文对不同提示压缩方法进行了全面的特征分析与评估。具体而言,我们分析了抽取式压缩、基于摘要的生成式压缩以及词元剪枝方法。令人惊讶的是,我们发现抽取式压缩在多数情况下优于其他所有方法,并能实现高达10倍的压缩率且精度损失极小。有趣的是,尽管近期存在诸多相关主张,我们发现词元剪枝方法的表现通常落后于抽取式压缩。仅在摘要任务中观察到边际改进。