Modern Large Language Models (LLMs) are capable of following long and complex instructions that enable a diverse amount of user tasks. However, despite Information Retrieval (IR) models using LLMs as the backbone of their architectures, nearly all of them still only take queries as input, with no instructions. For the handful of recent models that do take instructions, it's unclear how they use them. We introduce our dataset FollowIR, which contains a rigorous instruction evaluation benchmark as well as a training set for helping IR models learn to better follow real-world instructions. FollowIR builds off the long history of the TREC conferences: as TREC provides human annotators with instructions (also known as narratives) to determine document relevance, so should IR models be able to understand and decide relevance based on these detailed instructions. Our evaluation benchmark starts with three deeply judged TREC collections and alters the annotator instructions, re-annotating relevant documents. Through this process, we can measure how well IR models follow instructions, through a new pairwise evaluation framework. Our results indicate that existing retrieval models fail to correctly use instructions, using them for basic keywords and struggling to understand long-form information. However, we show that it is possible for IR models to learn to follow complex instructions: our new FollowIR-7B model has significant improvements (over 13%) after fine-tuning on our training set.
翻译:摘要:现代大型语言模型(LLMs)能够遵循复杂的长指令,从而支持多样化的用户任务。然而,尽管信息检索(IR)模型以LLMs为骨干架构,但几乎所有模型仍仅将查询作为输入,而未引入指令。对于少数已采纳指令的近期模型而言,其具体使用方式尚不明确。我们提出了数据集FollowIR,其中包含一套严格的指令评估基准,以及用于帮助IR模型学习更好遵循真实世界指令的训练集。FollowIR基于TREC会议的长期积累:正如TREC为人工标注者提供指令(又称叙述)以判定文档相关性,IR模型也应能基于这些详细指令理解并决定相关性。我们的评估基准首先选取三组经深度评判的TREC数据集,通过修改标注指令并重新标注相关文档,建立能够衡量IR模型遵循指令能力的全新配对评估框架。实验表明,现有检索模型未能正确使用指令——它们仅将其视为基础关键词,而难以处理长文本信息。然而,我们证明IR模型能够通过学习掌握复杂指令:经过在我们训练集上微调后的全新FollowIR-7B模型,其性能提升显著(超过13%)。