In-context learning (ICL) using large language models for tasks with many labels is challenging due to the limited context window, which makes it difficult to fit a sufficient number of examples in the prompt. In this paper, we use a pre-trained dense retrieval model to bypass this limitation, giving the model only a partial view of the full label space for each inference call. Testing with recent open-source LLMs (OPT, LLaMA), we set new state of the art performance in few-shot settings for three common intent classification datasets, with no finetuning. We also surpass fine-tuned performance on fine-grained sentiment classification in certain cases. We analyze the performance across number of in-context examples and different model scales, showing that larger models are necessary to effectively and consistently make use of larger context lengths for ICL. By running several ablations, we analyze the model's use of: a) the similarity of the in-context examples to the current input, b) the semantic content of the class names, and c) the correct correspondence between examples and labels. We demonstrate that all three are needed to varying degrees depending on the domain, contrary to certain recent works.
翻译:大规模语言模型在面向多标签任务时,受限于有限的上下文窗口,难以在提示中容纳足够数量的示例,这给上下文学习带来了挑战。本文采用预训练稠密检索模型突破这一限制,使模型每次推理调用仅能获取完整标签空间的部分视图。通过在近期的开源大语言模型(OPT、LLaMA)上进行测试,我们无需微调便在三个常见意图分类数据集的少样本场景中刷新了最优性能记录。在某些细粒度情感分类任务中,我们的方法甚至超越了微调模型的性能。我们分析了不同上下文示例数量与模型规模对性能的影响,表明更大规模的模型才能有效且稳定地利用更长上下文进行上下文学习。通过多项消融实验,我们探究了模型对以下要素的利用机制:a) 上下文示例与当前输入的相似性,b) 类别名称的语义内容,c) 示例与标签的正确对应关系。实验证明,这三个要素的依赖程度因领域而异,这与近期某些研究结论存在差异。