Prompting is the primary way to utilize the multitask capabilities of language models (LMs), but prompts occupy valuable space in the input context window, and repeatedly encoding the same prompt is computationally inefficient. Finetuning and distillation methods allow for specialization of LMs without prompting, but require retraining the model for each task. To avoid this trade-off entirely, we present gisting, which trains an LM to compress prompts into smaller sets of "gist" tokens which can be cached and reused for compute efficiency. Gist models can be trained with no additional cost over standard instruction finetuning by simply modifying Transformer attention masks to encourage prompt compression. On decoder (LLaMA-7B) and encoder-decoder (FLAN-T5-XXL) LMs, gisting enables up to 26x compression of prompts, resulting in up to 40% FLOPs reductions, 4.2% wall time speedups, and storage savings, all with minimal loss in output quality.
翻译:提示是利用语言模型(LM)多任务能力的主要方式,但提示会占据输入上下文窗口中的宝贵空间,且重复编码相同提示在计算上效率低下。微调和蒸馏方法虽能无需提示即可实现LM的专门化,但需针对每个任务重新训练模型。为完全避免这一权衡,我们提出了要点化(gisting)方法,该方法训练LM将提示压缩为更小的“要点”令牌集,这些令牌可被缓存并重复使用以提高计算效率。通过简单修改Transformer注意力掩码以促进提示压缩,要点模型可在标准指令微调基础上无额外成本进行训练。在解码器型(LLaMA-7B)和编码器-解码器型(FLAN-T5-XXL)LM上,要点化实现了高达26倍的提示压缩,带来40%的FLOPs减少、4.2%的墙钟时间加速以及存储节省,同时输出质量损失极小。