Instruction-following capabilities in large language models (LLMs) have significantly progressed, enabling more complex user interactions through detailed prompts. However, retrieval systems have not matched these advances, most of them still relies on traditional lexical and semantic matching techniques that fail to fully capture user intent. Recent efforts have introduced instruction-aware retrieval models, but these primarily focus on intrinsic content relevance, which neglects the importance of customized preferences for broader document-level attributes. This study evaluates the instruction-following capabilities of various retrieval models beyond content relevance, including LLM-based dense retrieval and reranking models. We develop InfoSearch, a novel retrieval evaluation benchmark spanning six document-level attributes: Audience, Keyword, Format, Language, Length, and Source, and introduce novel metrics -- Strict Instruction Compliance Ratio (SICR) and Weighted Instruction Sensitivity Evaluation (WISE) to accurately assess the models' responsiveness to instructions. Our findings reveal that while reranking models generally surpass retrieval models in instruction following, they still face challenges in handling certain attributes. Moreover, although instruction fine-tuning and increased model size lead to better performance, most models fall short of achieving comprehensive instruction compliance as assessed by our benchmark.
翻译:大型语言模型(LLM)的指令遵循能力已取得显著进展,使得通过详细提示进行更复杂的用户交互成为可能。然而,检索系统尚未跟上这些进步,大多数仍依赖于传统的词汇和语义匹配技术,这些技术未能充分捕捉用户意图。近期研究引入了指令感知检索模型,但这些模型主要关注内在内容相关性,忽视了针对更广泛文档级属性的定制化偏好的重要性。本研究评估了各种检索模型在内容相关性之外的指令遵循能力,包括基于LLM的稠密检索和重排序模型。我们开发了InfoSearch,这是一个新颖的检索评估基准,涵盖六个文档级属性:受众、关键词、格式、语言、长度和来源,并引入了新的评估指标——严格指令遵循率(SICR)和加权指令敏感性评估(WISE),以准确评估模型对指令的响应能力。我们的研究结果表明,尽管重排序模型在指令遵循方面通常优于检索模型,但在处理某些属性时仍面临挑战。此外,虽然指令微调和增大模型规模能带来更好的性能,但根据我们的基准评估,大多数模型仍未能实现全面的指令遵循。