Large language models (LLMs) have demonstrated impressive capabilities in various natural language processing tasks. Despite this, their application to information retrieval (IR) tasks is still challenging due to the infrequent occurrence of many IR-specific concepts in natural language. While prompt-based methods can provide task descriptions to LLMs, they often fall short in facilitating a comprehensive understanding and execution of IR tasks, thereby limiting LLMs' applicability. To address this gap, in this work, we explore the potential of instruction tuning to enhance LLMs' proficiency in IR tasks. We introduce a novel instruction tuning dataset, INTERS, encompassing 20 tasks across three fundamental IR categories: query understanding, document understanding, and query-document relationship understanding. The data are derived from 43 distinct datasets with manually written templates. Our empirical results reveal that INTERS significantly boosts the performance of various publicly available LLMs, such as LLaMA, Mistral, and Phi, in IR tasks. Furthermore, we conduct extensive experiments to analyze the effects of instruction design, template diversity, few-shot demonstrations, and the volume of instructions on performance. We make our dataset and the fine-tuned models publicly accessible at https://github.com/DaoD/INTERS.
翻译:大型语言模型(LLM)在各种自然语言处理任务中展现出卓越的能力。尽管如此,由于许多信息检索(IR)特定概念在自然语言中并不常见,将其应用于信息检索任务仍具挑战性。虽然基于提示的方法可以向LLM提供任务描述,但它们往往难以促进对IR任务的全面理解和执行,从而限制了LLM的适用性。为弥补这一不足,本研究探索了指令微调在提升LLM执行IR任务能力方面的潜力。我们引入了一个新颖的指令微调数据集INTERS,涵盖查询理解、文档理解以及查询-文档关系理解这三个基础IR类别中的20项任务。该数据源自43个不同数据集,并辅以人工编写的模板。我们的实证结果表明,INTERS显著提升了多种公开可用LLM(如LLaMA、Mistral和Phi)在IR任务中的性能。此外,我们进行了大量实验,以分析指令设计、模板多样性、少样本示例以及指令数量对性能的影响。我们将数据集及微调后的模型公开于https://github.com/DaoD/INTERS。