Large language models (LLMs) have demonstrated remarkable capabilities across various research domains, including the field of Information Retrieval (IR). However, the responses generated by off-the-shelf LLMs tend to be generic, i.e., cannot capture the distinctiveness of each document with similar content. This limits the performance of LLMs in IR because finding and distinguishing relevant documents from substantial similar documents is a typical problem in many IR tasks. To address this issue, we propose an unsupervised alignment method, namely Reinforcement Learning from Contrastive Feedback (RLCF), empowering LLMs to generate both high-quality and context-specific responses. Our approach constructs unsupervised contrastive feedback signals based on similar document groups, and adopts a reward function, named group-wise reciprocal rank, to optimize LLMs within a standard Proximal Policy Optimization. We conduct extensive experiments to evaluate the effectiveness of RLCF on LLMs built with different languages and parameter sizes on multiple downstream IR applications. RLCF significantly outperforms existing alignment methods, and RLCF-optimized LLMs demonstrate considerable improvement in generating responses with distinctiveness.
翻译:大语言模型(LLM)在包括信息检索(IR)在内的多个研究领域展现出卓越能力。然而,现成LLM生成的回答往往趋于模板化,即无法捕捉内容相似文档间的差异性。这一问题限制了LLM在IR任务中的性能,因为从大量相似文档中发现并区分相关文档是许多IR任务的典型挑战。为解决该问题,我们提出一种名为"基于对比反馈的强化学习"(RLCF)的无监督对齐方法,使LLM能够生成兼具高质量与上下文特异性的回答。本方法基于相似文档组构建无监督对比反馈信号,并采用名为"分组倒数排名"的奖励函数,在标准近端策略优化框架下优化LLM。我们在不同语言、不同参数规模的LLM上开展广泛实验,评估RLCF在多个下游IR应用中的有效性。结果表明,RLCF显著优于现有对齐方法,经RLCF优化的LLM在生成具有区分度的回答方面展现出显著提升。