Prompting is the primary way to utilize the multitask capabilities of language models (LMs), but prompts occupy valuable space in the input context window, and repeatedly encoding the same prompt is computationally inefficient. Finetuning and distillation methods allow for specialization of LMs without prompting, but require retraining the model for each task. To avoid this trade-off entirely, we present gisting, which trains an LM to compress prompts into smaller sets of "gist" tokens which can be cached and reused for compute efficiency. Gist models can be trained with no additional cost over standard instruction finetuning by simply modifying Transformer attention masks to encourage prompt compression. On decoder (LLaMA-7B) and encoder-decoder (FLAN-T5-XXL) LMs, gisting enables up to 26x compression of prompts, resulting in up to 40% FLOPs reductions, 4.2% wall time speedups, and storage savings, all with minimal loss in output quality.
翻译:提示是发挥语言模型多任务能力的主要方式,但提示词会占用输入上下文窗口中的宝贵空间,且重复编码相同提示词的计算效率较低。微调和蒸馏方法无需提示即可实现语言模型的特化,但需针对每项任务重新训练模型。为彻底规避这一权衡,我们提出“Gisting”方法——训练语言模型将提示词压缩为更小的“Gist”标记集合,这些标记可被缓存并重复使用,从而提升计算效率。Gist模型可通过标准指令微调方式训练,仅需修改Transformer注意力掩码以促进提示压缩,且无需额外训练成本。在解码器(LLaMA-7B)和编码器-解码器(FLAN-T5-XXL)语言模型上,Gisting方法可实现高达26倍的提示压缩,从而减少最多40%的FLOPs、加速4.2%的端到端运行时间并节省存储空间,同时输出质量损失极小。