Retrieval-augmented generation (RAG) is typically optimized for topical relevance, yet its success ultimately depends on whether retrieved passages are useful for a large language model (LLM) to generate correct and complete answers. We argue that such utility is often LLM-specific rather than universal, due to differences in models' knowledge, reasoning, and ability to leverage evidence. We formalize LLM-specific utility as the performance improvement of a target LLM when a passage is provided, compared to answering without evidence. To systematically study LLM-specific utility, we construct a benchmark of LLM-specific gold utilitarian passages for four LLMs (Qwen3-8B/14B/32B and Llama3.1-8B) on three QA datasets (Natural Questions, TriviaQA, and MS MARCO-FQA). Our analysis shows that utilitarian passages are model-dependent and non-transferable: each LLM performs best with its own utilitarian evidence, while evidence optimized for other LLMs is consistently suboptimal. Human-annotated evidence remains a strong general baseline but does not fully match individual LLM utility needs. We further introduce the LLM-specific utility judgment task and find that existing utility-aware selection and scoring methods largely capture model-agnostic usefulness and struggle to reliably estimate LLM-specific utility. Overall, our findings highlight the limitations of current utility-aware retrieval and motivate generator-tailored evidence selection for improving RAG.
翻译:检索增强生成通常针对主题相关性进行优化,但其成功最终取决于检索到的段落是否有助于大语言模型生成正确且完整的答案。我们认为,由于模型在知识储备、推理能力及证据利用能力上的差异,这种效用往往具有模型特定性而非普适性。我们将面向大语言模型的特定效用形式化为:目标大语言模型在提供特定段落时相比无证据回答的性能提升。为系统研究该问题,我们针对四款大语言模型(Qwen3-8B/14B/32B 和 Llama3.1-8B)在三个问答数据集(Natural Questions、TriviaQA 和 MS MARCO-FQA)上构建了面向大语言模型的特定黄金效用段落基准。分析表明,效用段落具有模型依赖性且不可迁移:每个大语言模型使用自身优化的效用证据时表现最佳,而为其他模型优化的证据则持续表现次优。人工标注证据虽仍是强泛化基线,但无法完全匹配个体大语言模型的效用需求。我们进一步提出面向大语言模型的特定效用判定任务,发现现有基于效用的检索选择与评分方法主要捕获模型无关的有用性,难以可靠评估模型特定效用。总体而言,我们的研究揭示了当前效用感知检索的局限性,并为通过生成器定制化证据选择改进检索增强生成提供了理论依据。