Modern Language Models (LMs) are capable of following long and complex instructions that enable a large and diverse set of user requests. While Information Retrieval (IR) models use these LMs as the backbone of their architectures, virtually none of them allow users to provide detailed instructions alongside queries, thus limiting their ability to satisfy complex information needs. In this work, we study the use of instructions in IR systems. First, we introduce our dataset FollowIR, which contains a rigorous instruction evaluation benchmark as well as a training set for helping IR models learn to better follow real-world instructions. FollowIR repurposes detailed instructions -- also known as narratives -- developed for professional assessors to evaluate retrieval systems. In particular, we build our benchmark from three collections curated for shared tasks at the Text REtrieval Conference (TREC). These collections contains hundreds to thousands of labeled documents per query, making them suitable for our exploration. Through this process, we can measure how well IR models follow instructions, through a new pairwise evaluation framework. Our results indicate that existing retrieval models fail to correctly use instructions, using them for basic keywords and struggling to understand long-form information. However, we show that it is possible for IR models to learn to follow complex instructions: our new FollowIR-7B model has significant improvements after fine-tuning on our training set.
翻译:现代语言模型(LMs)能够遵循复杂的长指令,从而支持大量多样化的用户请求。尽管信息检索(IR)模型将这些语言模型作为其架构的骨干,但几乎没有模型允许用户在查询之外提供详细的指令,这限制了它们满足复杂信息需求的能力。本研究探讨了指令在IR系统中的应用。首先,我们引入了数据集FollowIR,它包含一个严格的指令评估基准以及一个训练集,旨在帮助IR模型学习更好地遵循真实世界的指令。FollowIR重新利用了为专业评估人员开发以评估检索系统的详细指令(也称为叙述)。具体而言,我们基于文本检索会议(TREC)中共享任务收集的三个数据集构建了基准。这些数据集每个查询包含数百到数千个标注文档,使其适用于我们的探索。通过这一过程,我们能够通过一种新的成对评估框架来衡量IR模型遵循指令的程度。我们的结果表明,现有检索模型未能正确使用指令,仅将其用于基本关键词,且难以理解长格式信息。然而,我们证明了IR模型可以学习遵循复杂指令:我们的新模型FollowIR-7B在经过训练集微调后取得了显著改进。