In-context learning (ICL) facilitates large language models (LLMs) exhibiting spectacular emergent capabilities in various scenarios. Unfortunately, introducing demonstrations easily makes the prompt length explode, bringing a significant burden to hardware. In addition, random demonstrations usually achieve limited improvements in ICL, necessitating demonstration selection among accessible candidates. Previous studies introduce extra modules to perform demonstration compression or selection independently. In this paper, we propose an ICL framework UniICL, which Unifies demonstration selection and compression, and final response generation via a single frozen LLM. Specifically, UniICL first projects actual demonstrations and inference text inputs into short virtual tokens, respectively. Then, virtual tokens are applied to select suitable demonstrations by measuring semantic similarity within latent space among candidate demonstrations and inference input. Finally, inference text inputs together with selected virtual demonstrations are fed into the same frozen LLM for response generation. Notably, UniICL is a parameter-efficient framework that only contains 17M trainable parameters originating from the projection layer. We conduct experiments and analysis over in- and out-domain datasets of both generative and understanding tasks, encompassing ICL scenarios with plentiful and limited demonstration candidates. Results show that UniICL effectively unifies $12 \times$ compression, demonstration selection, and response generation, efficiently scaling up the baseline from 4-shot to 64-shot ICL in IMDb with 24 GB CUDA allocation
翻译:上下文学习(ICL)使大语言模型(LLM)能够在多种场景中展现出卓越的涌现能力。然而,引入演示样例极易导致提示文本长度激增,给硬件带来沉重负担。此外,随机选择的演示通常对ICL性能提升有限,因此需要在可用候选集中进行演示选择。先前研究通过引入额外模块分别实现演示压缩或选择。本文提出ICL框架UniICL,通过单个冻结的LLM统一实现演示选择、压缩及最终响应生成。具体而言,UniICL首先将实际演示样例与推理文本输入分别映射为短虚拟令牌,随后在潜在空间中通过测量候选演示与推理输入的语义相似度,利用虚拟令牌筛选合适演示。最后,推理文本输入与筛选出的虚拟演示共同输入同一冻结LLM以生成响应。值得注意的是,UniICL是参数高效型框架,仅包含源自投影层的1700万可训练参数。我们在生成与理解任务的领域内外数据集上开展实验与分析,涵盖演示候选丰富与受限的ICL场景。结果表明,UniICL有效统一了$12 \times$压缩、演示选择与响应生成,在IMDb数据集上以24GB CUDA显存分配,将基线方法从4样本ICL高效扩展至64样本ICL。