Modern Language Models (LMs) are capable of following long and complex instructions that enable a large and diverse set of user requests. While Information Retrieval (IR) models use these LMs as the backbone of their architectures, virtually none of them allow users to provide detailed instructions alongside queries, thus limiting their ability to satisfy complex information needs. In this work, we study the use of instructions in IR systems. First, we introduce our dataset FollowIR, which contains a rigorous instruction evaluation benchmark as well as a training set for helping IR models learn to better follow real-world instructions. FollowIR repurposes detailed instructions -- also known as narratives -- developed for professional assessors to evaluate retrieval systems. In particular, we build our benchmark from three collections curated for shared tasks at the Text REtrieval Conference (TREC). These collections contains hundreds to thousands of labeled documents per query, making them suitable for our exploration. Through this process, we can measure how well IR models follow instructions, through a new pairwise evaluation framework. Our results indicate that existing retrieval models fail to correctly use instructions, using them for basic keywords and struggling to understand long-form information. However, we show that it is possible for IR models to learn to follow complex instructions: our new FollowIR-7B model has significant improvements after fine-tuning on our training set.
翻译:摘要:现代语言模型(LMs)能够遵循长篇幅且复杂的指令,从而满足用户广泛且多样化的请求。尽管信息检索(IR)模型将这些语言模型作为其架构的核心,但几乎没有任何模型允许用户在查询之外提供详细的指令,这限制了其满足复杂信息需求的能力。本研究探讨了IR系统中指令的使用。首先,我们引入了数据集FollowIR,其包含一个严格的指令评估基准以及一个训练集,旨在帮助IR模型学会更好地遵循现实世界中的指令。FollowIR重新利用了为专业评估者设计的详细指令(即叙事文本),用于评估检索系统的性能。具体而言,我们基于文本检索会议(TREC)共享任务中的三个语料集合构建了基准。这些语料集合包含每个查询数百到数千个已标注文档,适合用于我们的探索。通过这一流程,我们利用新的成对评估框架衡量IR模型遵循指令的程度。结果表明,现有检索模型未能正确使用指令,仅将其用于基本关键词,难以理解长篇幅信息。然而,我们证明了IR模型可以学会遵循复杂指令:经过训练集微调后,我们的新模型FollowIR-7B取得了显著改进。