PosIR: Position-Aware Heterogeneous Information Retrieval Benchmark

In real-world documents, the information relevant to a user query may reside anywhere from the beginning to the end. This makes position bias -- a systematic tendency of retrieval models to favor or neglect content based on its location -- a critical concern. Although recent studies have identified such bias, existing analyses focus predominantly on English, fail to disentangle document length from information position, and lack a standardized framework for systematic diagnosis. To address these limitations, we introduce PosIR (Position-Aware Information Retrieval), the first standardized benchmark designed to systematically diagnose position bias in diverse retrieval scenarios. PosIR comprises 310 datasets spanning 10 languages and 31 domains, with relevance tied to precise reference spans. At its methodological core, PosIR employs a length-controlled bucketing strategy that groups queries by positive document length and analyzes positional effects within each bucket. This design strictly isolates position bias from length-induced performance degradation. Extensive experiments on 10 state-of-the-art embedding-based retrieval models reveal that: (1) retrieval performance on PosIR with documents exceeding 1536 tokens correlates poorly with the MMTEB benchmark, exposing limitations of current short-text evaluations; (2) position bias is pervasive in embedding models and even increases with document length, with most models exhibiting primacy bias while certain models show unexpected recency bias; (3) as an exploratory investigation, gradient-based saliency analysis further uncovers two distinct internal mechanisms that correlate with these positional preferences. We hope that PosIR can serve as a valuable diagnostic framework to advance the development of position-robust retrieval systems.

翻译：在现实文档中，与用户查询相关的信息可能分布于文档的任意位置——从开头到结尾皆有可能。这使得位置偏差——即检索模型基于内容位置而系统性地偏好或忽略某些内容的倾向——成为一个关键问题。尽管近期研究已识别出此类偏差，但现有分析主要集中于英语，未能将文档长度与信息位置的影响分离，且缺乏用于系统化诊断的标准化框架。为应对这些局限，我们提出了PosIR（位置感知信息检索），这是首个旨在系统化诊断多样化检索场景中位置偏差的标准化基准。PosIR包含涵盖10种语言和31个领域的310个数据集，其相关性均与精确的参考文本片段绑定。在方法学核心上，PosIR采用长度控制的分桶策略：按正例文档长度对查询进行分组，并在每个桶内分析位置效应。该设计严格隔离了位置偏差与因长度引起的性能衰减。通过对10个基于嵌入的先进检索模型进行大量实验，我们发现：（1）在文档超过1536个标记的PosIR数据集上，检索性能与MMTEB基准的相关性较弱，这揭示了当前短文本评估的局限性；（2）位置偏差在嵌入模型中普遍存在，甚至随文档长度增加而加剧，大多数模型呈现首因偏差，而某些模型却表现出出人意料的近因偏差；（3）作为探索性研究，基于梯度的显著性分析进一步揭示了与这些位置偏好相关的两种不同内部机制。我们希望PosIR能作为一个有价值的诊断框架，推动位置鲁棒性检索系统的发展。