Large Language models (LLMs) have demonstrated impressive in-context learning (ICL) capabilities, where a LLM makes predictions for a given test input together with a few input-output pairs (demonstrations). Nevertheless, the inclusion of demonstrations leads to a quadratic increase in the computational overhead of the self-attention mechanism. Existing solutions attempt to distill lengthy demonstrations into compact vectors. However, they often require task-specific retraining or compromise LLM's in-context learning performance. To mitigate these challenges, we present Meta dEmonstratioN Distillation (MEND), where a language model learns to distill any lengthy demonstrations into vectors without retraining for a new downstream task. We exploit the knowledge distillation to enhance alignment between MEND and LLM, achieving both efficiency and effectiveness simultaneously. MEND is endowed with the meta-knowledge of distilling demonstrations through a two-stage training process, which includes meta-distillation pretraining and fine-tuning. Comprehensive evaluations across seven diverse ICL task partitions using decoder-only (GPT-2) and encoder-decoder (T5) attest to MEND's prowess. It not only matches but often outperforms the Vanilla ICL as well as other state-of-the-art distillation models, while significantly reducing the computational demands. This innovation promises enhanced scalability and efficiency for the practical deployment of large language models
翻译:大型语言模型(LLM)展现出令人瞩目的上下文学习(ICL)能力,即通过给定测试输入及少量输入-输出对(示例)进行预测。然而,引入示例会导致自注意力机制的计算开销呈二次方增长。现有方法试图将冗长的示例蒸馏为紧凑向量,但往往需要针对特定任务重新训练,或牺牲LLM的上下文学习性能。为应对这些挑战,我们提出元示例蒸馏(MEND),该方法使语言模型无需针对新下游任务重新训练,即可学习将任意冗长示例蒸馏为向量。我们利用知识蒸馏技术增强MEND与LLM的对齐,同时实现效率与效果的双重提升。MEND通过两阶段训练流程(包括元蒸馏预训练与微调)获得蒸馏示例的元知识。在七个多样化ICL任务分区上,使用仅解码器架构(GPT-2)和编码器-解码器架构(T5)进行的全面评估验证了MEND的性能优势。该模型不仅能媲美甚至常超越原始ICL及其他最先进的蒸馏模型,同时显著降低计算需求。这项创新为大型语言模型的实际部署提供了增强的可扩展性与效率保障。