In-context learning (ICL) of large language models (LLMs) has attracted increasing attention in the community where LLMs make predictions only based on instructions augmented with a few examples. Existing example selection methods for ICL utilize sparse or dense retrievers and derive effective performance. However, these methods do not utilize direct feedback of LLM to train the retriever and the examples selected can not necessarily improve the analogy ability of LLM. To tackle this, we propose our policy-based reinforcement learning framework for example selection (RLS), which consists of a language model (LM) selector and an LLM generator. The LM selector encodes the candidate examples into dense representations and selects the top-k examples into the demonstration for LLM. The outputs of LLM are adopted to compute the reward and policy gradient to optimize the LM selector. We conduct experiments on different datasets and significantly outperform existing example selection methods. Moreover, our approach shows advantages over supervised finetuning (SFT) models in few shot setting. Further experiments show the balance of abundance and the similarity with the test case of examples is important for ICL performance of LLM.
翻译:大型语言模型(LLM)的上下文学习(ICL)在学术界日益受到关注,该方法仅基于指令和少量示例进行预测。现有的ICL示例选择方法采用稀疏或密集检索器,并取得了有效性能。然而,这些方法未利用LLM的直接反馈来训练检索器,且所选示例未必能提升LLM的类比推理能力。为此,我们提出了一种基于策略强化学习的示例选择框架(RLS),该框架由语言模型(LM)选择器和LLM生成器组成。LM选择器将候选示例编码为密集表示,并选取前k个示例构成LLM的演示样本。利用LLM的输出计算奖励和策略梯度,以优化LM选择器。我们在多个数据集上进行实验,结果显著优于现有示例选择方法。此外,在少样本设置下,我们的方法展现出优于监督微调(SFT)模型的优势。进一步实验表明,示例的丰富度及其与测试用例的相似性之间的平衡对LLM的ICL性能至关重要。