Large language models (LLMs) have demonstrated impressive capabilities in various natural language processing tasks. Despite this, their application to information retrieval (IR) tasks is still challenging due to the infrequent occurrence of many IR-specific concepts in natural language. While prompt-based methods can provide task descriptions to LLMs, they often fall short in facilitating a comprehensive understanding and execution of IR tasks, thereby limiting LLMs' applicability. To address this gap, in this work, we explore the potential of instruction tuning to enhance LLMs' proficiency in IR tasks. We introduce a novel instruction tuning dataset, INTERS, encompassing 20 tasks across three fundamental IR categories: query understanding, document understanding, and query-document relationship understanding. The data are derived from 43 distinct datasets with manually written templates. Our empirical results reveal that INTERS significantly boosts the performance of various publicly available LLMs, such as LLaMA, Mistral, and Phi, in IR tasks. Furthermore, we conduct extensive experiments to analyze the effects of instruction design, template diversity, few-shot demonstrations, and the volume of instructions on performance. We make our dataset and the fine-tuned models publicly accessible at~\url{https://github.com/DaoD/INTERS}.
翻译:大语言模型(LLMs)已在各类自然语言处理任务中展现出卓越能力。然而,由于信息检索(IR)领域的许多特定概念在自然语言中出现频率较低,LLMs在该领域的应用仍面临挑战。尽管基于提示的方法能为LLMs提供任务描述,但这些方法往往难以有效促进对IR任务的全面理解与执行,从而限制了LLMs的适用性。为弥补这一不足,本研究探索了通过指令微调增强LLMs在IR任务中能力的潜力。我们提出了一种新型指令微调数据集INTERS,涵盖查询理解、文档理解及查询-文档关系理解三大基础IR类别中的20项任务。该数据集源自43个不同数据集,并辅以人工编写的模板。实验结果表明,INTERS显著提升了多个公开可用LLMs(如LLaMA、Mistral和Phi)在IR任务中的性能。此外,我们通过广泛实验分析了指令设计、模板多样性、少样本示例及指令数量对性能的影响。数据集和微调后的模型已在~\url{https://github.com/DaoD/INTERS} 公开提供。