Large Language models (LLMs) have demonstrated impressive in-context learning (ICL) capabilities, where a LLM makes predictions for a given test input together with a few input-output pairs (demonstrations). Nevertheless, the inclusion of demonstrations leads to a quadratic increase in the computational overhead of the self-attention mechanism. Existing solutions attempt to distill lengthy demonstrations into compact vectors. However, they often require task-specific retraining or compromise LLM's in-context learning performance. To mitigate these challenges, we present Meta dEmonstratioN Distillation (MEND), where a language model learns to distill any lengthy demonstrations into vectors without retraining for a new downstream task. We exploit the knowledge distillation to enhance alignment between MEND and LLM, achieving both efficiency and effectiveness simultaneously. MEND is endowed with the meta-knowledge of distilling demonstrations through a two-stage training process, which includes meta-distillation pretraining and fine-tuning. Comprehensive evaluations across seven diverse ICL task partitions using decoder-only (GPT-2) and encoder-decoder (T5) attest to MEND's prowess. It not only matches but often outperforms the Vanilla ICL as well as other state-of-the-art distillation models, while significantly reducing the computational demands. This innovation promises enhanced scalability and efficiency for the practical deployment of large language models
翻译:大语言模型展现出卓越的情境学习能力,即通过结合少量输入-输出示例对给定测试输入进行预测。然而,引入示例会导致自注意力机制的计算开销呈二次方增长。现有方案尝试将冗长示例压缩为紧凑向量,但往往需要针对特定任务重新训练或损害大语言模型的情境学习性能。为应对这些挑战,我们提出元示例蒸馏方法(MEND),使语言模型无需针对新下游任务重新训练即可将任意冗长示例蒸馏为向量。我们利用知识蒸馏增强MEND与大语言模型的对齐,同步提升效率与性能。通过包含元蒸馏预训练和微调的两阶段训练过程,MEND获得了蒸馏示例的元知识。在七个不同情境学习任务分区上使用仅解码器架构(GPT-2)和编码器-解码器架构(T5)的综合评估验证了MEND的卓越性能。该方法不仅能匹配甚至常超越传统情境学习及其他前沿蒸馏模型,同时显著降低计算需求。这项创新为大语言模型的实用部署带来了增强的可扩展性与效率。