Large language models (LLMs) have demonstrated impressive capabilities in various natural language processing tasks. Despite this, their application to information retrieval (IR) tasks is still challenging due to the infrequent occurrence of many IR-specific concepts in natural language. While prompt-based methods can provide task descriptions to LLMs, they often fall short in facilitating comprehensive understanding and execution of IR tasks, thereby limiting LLMs' applicability. To address this gap, in this work, we explore the potential of instruction tuning to enhance LLMs' proficiency in IR tasks. We introduce a novel instruction tuning dataset, INTERS, encompassing 21 tasks across three fundamental IR categories: query understanding, document understanding, and query-document relationship understanding. The data are derived from 43 distinct datasets with manually written templates. Our empirical results reveal that INTERS significantly boosts the performance of various publicly available LLMs, such as LLaMA, Mistral, and Phi, in search-related tasks. Furthermore, we conduct a comprehensive analysis to ascertain the effects of base model selection, instruction design, volume of instructions, and task variety on performance. We make our dataset and the models fine-tuned on it publicly accessible at https://github.com/DaoD/INTERS.
翻译:大语言模型(LLMs)已在多种自然语言处理任务中展现出惊人能力。然而,由于信息检索(IR)领域许多特有概念在自然语言中不常出现,LLMs在检索任务中的应用仍面临挑战。基于提示的方法虽能为LLMs提供任务描述,但往往难以促使其全面理解并执行IR任务,从而限制了LLMs的适用性。为填补这一空白,本研究探索了指令微调在增强LLMs IR任务能力方面的潜力。我们创新性地提出了指令微调数据集INTERS,涵盖三个基础IR类别(查询理解、文档理解、查询-文档关系理解)中的21个任务。该数据集源于43个不同数据集,并配有人工编写的模板。实证结果表明,INTERS显著提升了多种公开LLMs(如LLaMA、Mistral、Phi)在搜索相关任务中的表现。此外,我们通过全面分析确定了基座模型选择、指令设计、指令数量及任务多样性对性能的影响。我们将数据集及基于其微调的模型公开于https://github.com/DaoD/INTERS。