Instruction-following capabilities in LLMs have progressed significantly, enabling more complex user interactions through detailed prompts. However, retrieval systems have not matched these advances, most of them still relies on traditional lexical and semantic matching techniques that fail to fully capture user intent. Recent efforts have introduced instruction-aware retrieval models, but these primarily focus on intrinsic content relevance, which neglects the importance of customized preferences for broader document-level attributes. This study evaluates the instruction-following capabilities of various retrieval models beyond content relevance, including LLM-based dense retrieval and reranking models. We develop InfoSearch, a novel retrieval evaluation benchmark spanning six document-level attributes: Audience, Keyword, Format, Language, Length, and Source, and introduce novel metrics -- Strict Instruction Compliance Ratio (SICR) and Weighted Instruction Sensitivity Evaluation (WISE) to accurately assess the models' responsiveness to instructions. Our findings indicate that although fine-tuning models on instruction-aware retrieval datasets and increasing model size enhance performance, most models still fall short of instruction compliance.
翻译:大型语言模型(LLM)的指令遵循能力已取得显著进展,使得通过详细提示实现更复杂的用户交互成为可能。然而,检索系统尚未跟上这些进步,大多数仍依赖传统的词汇和语义匹配技术,无法充分捕捉用户意图。近期研究引入了指令感知检索模型,但这些模型主要关注内在内容相关性,忽视了针对更广泛文档级属性的定制化偏好的重要性。本研究评估了各种检索模型在内容相关性之外的指令遵循能力,包括基于LLM的稠密检索和重排序模型。我们开发了InfoSearch——一个涵盖六个文档级属性(受众、关键词、格式、语言、长度和来源)的新型检索评估基准,并引入了严格指令遵循率(SICR)和加权指令敏感度评估(WISE)两项新指标,以准确评估模型对指令的响应能力。我们的研究结果表明,尽管在指令感知检索数据集上微调模型及增加模型规模能提升性能,但大多数模型仍未能充分遵循指令。