Transformer-based NLP models are powerful but have high computational costs that limit deployment scenarios. Finetuned encoder-decoder models are popular in specialized domains and can outperform larger more generalized decoder-only models, such as GPT-4. We introduce a new configuration for encoder-decoder models that improves efficiency on structured output and question-answering tasks where multiple outputs are required of a single input. Our method, prompt-in-decoder (PiD), encodes the input once and decodes output in parallel, boosting both training and inference efficiency by avoiding duplicate input encoding, thereby reducing the decoder's memory footprint. We achieve computation reduction that roughly scales with the number of subtasks, gaining up to 4.6x speed-up over state-of-the-art models for dialogue state tracking, summarization, and question-answering tasks with comparable or better performance. We release our training/inference code and checkpoints.
翻译:基于Transformer的NLP模型功能强大,但计算成本高昂,限制了其部署场景。微调后的编码器-解码器模型在专业领域广受欢迎,且能超越规模更大、更通用的纯解码器模型(如GPT-4)。我们提出了一种编码器-解码器模型的新配置,旨在提升结构化输出和问答任务(需对单一输入生成多个输出)的效率。我们的方法——解码器提示(PiD,Prompt-in-Decoder)——对输入进行一次性编码,并并行解码输出,通过避免重复的输入编码来提升训练和推理效率,从而降低解码器的内存占用。我们实现了计算量的缩减,该缩减幅度大致与子任务数量成正比,并在对话状态跟踪、摘要和问答任务上相比现有最优模型获得了高达4.6倍的加速,同时性能相当或更优。我们发布了训练/推理代码和模型检查点。