Information retrieval (IR) evaluation remains challenging due to incomplete IR benchmark datasets that contain unlabeled relevant chunks. While LLMs and LLM-human hybrid strategies reduce costly human effort, they remain prone to LLM overconfidence and ineffective AI-to-human escalation. To address this, we propose DREAM, a multi-round debate-based relevance assessment framework with LLM agents, built on opposing initial stances and iterative reciprocal critique. Through our agreement-based debate, it yields more accurate labeling for certain cases and more reliable AI-to-human escalation for uncertain ones, achieving 95.2% labeling accuracy with only 3.5% human involvement. Using DREAM, we build BRIDGE, a refined benchmark that mitigates evaluation bias and enables fairer retriever comparison by uncovering 29,824 missing relevant chunks. We then re-benchmark IR systems and extend evaluation to RAG, showing that unaddressed holes not only distort retriever rankings but also drive retrieval-generation misalignment. The relevance assessment framework is available at https: //github.com/DISL-Lab/DREAM-ICLR-26; and the BRIDGE dataset is available at https://github.com/DISL-Lab/BRIDGE-Benchmark.
翻译:信息检索(IR)评估仍面临挑战,主要源于IR基准数据集的不完整性,其中包含大量未标注的相关文本块。尽管大语言模型(LLM)及LLM-人混合策略能够减少高昂的人工标注成本,但它们仍易受LLM过度自信问题的影响,且AI到人的任务升级机制往往效率不足。为解决这些问题,我们提出了DREAM,一个基于多轮辩论的相关性评估框架,该框架利用LLM智能体,建立在对立初始立场与迭代式相互批判的基础上。通过基于共识的辩论过程,DREAM能够对确定案例产生更准确的标注,并对不确定案例实现更可靠的AI到人任务升级,仅需3.5%的人工介入即可达到95.2%的标注准确率。利用DREAM,我们构建了BRIDGE——一个经过优化的基准数据集,通过补全29,824个缺失的相关文本块,有效缓解了评估偏差,实现了更公平的检索器性能比较。基于此,我们对现有IR系统进行了重新评估,并将评估范围扩展至检索增强生成(RAG)系统,结果表明:未被处理的标注缺失不仅会扭曲检索器的性能排名,还会导致检索与生成阶段之间的错位。相关性评估框架已发布于https://github.com/DISL-Lab/DREAM-ICLR-26;BRIDGE数据集已发布于https://github.com/DISL-Lab/BRIDGE-Benchmark。