We formalize the problem of prompt compression for large language models (LLMs) and present a framework to unify token-level prompt compression methods which create hard prompts for black-box models. We derive the distortion-rate function for this setup as a linear program, and provide an efficient algorithm to compute this fundamental limit via the dual of the linear program. Using the distortion-rate function as the baseline, we study the performance of existing compression schemes on a synthetic dataset consisting of prompts generated from a Markov chain, natural language queries, and their respective answers. Our empirical analysis demonstrates the criticality of query-aware prompt compression, where the compressor has knowledge of the downstream task/query for the black-box LLM. We show that there is a large gap between the performance of current prompt compression methods and the optimal strategy, and propose Adaptive QuerySelect, a query-aware, variable-rate adaptation of a prior work to close the gap. We extend our experiments to a small natural language dataset to further confirm our findings on our synthetic dataset.
翻译:我们形式化了大语言模型(LLM)的提示压缩问题,并提出一个框架来统一为黑盒模型创建硬提示的令牌级提示压缩方法。我们推导了该设置下的失真率函数,并将其表述为一个线性规划问题,同时提供了一种通过线性规划对偶问题计算这一基本极限的高效算法。以失真率函数为基准,我们在一个由马尔可夫链生成的提示、自然语言查询及其相应答案组成的合成数据集上,研究了现有压缩方案的性能。我们的实证分析揭示了查询感知提示压缩的关键性,即压缩器需知晓下游黑盒LLM的任务/查询信息。研究表明,当前提示压缩方法的性能与最优策略之间存在显著差距,为此我们提出了Adaptive QuerySelect——一种基于先前工作的查询感知、可变率自适应方法,以弥合此差距。我们将实验扩展至一个小型自然语言数据集,进一步验证了在合成数据集上的发现。