Mitigating Preference Leakage via Strict Estimator Separation for Normative Generative Ranking

In Generative Information Retrieval (GenIR), the bottleneck has shifted from generation to the selection of candidates, particularly for normative criteria such as cultural relevance. Current LLM-as-a-Judge evaluations often suffer from circularity and preference leakage, where overlapping supervision and evaluation models inflate performance. We address this by formalising cultural relevance as a within-query ranking task and introducing a leakage-free two-judge framework that strictly separates supervision (Judge B) from evaluation (Judge A). On a new benchmark of 33,052 (NGR-33k) culturally grounded stories, we find that while classical baselines yield only modest gains, a dense bi-encoder distilled from a Judge-B-supervised Cross-Encoder is highly effective. Although the Cross-Encoder provides a strong supervision signal for distillation, the distilled BGE-M3 model substantially outperforms it under leakage-free Judge~A evaluation. We validate our framework on the human-curated Moral Stories dataset, showing strong alignment with human norms. Our results demonstrate that rigorous evaluator separation is a prerequisite for credible GenIR evaluation, proving that subtle cultural preferences can be distilled into efficient rankers without leakage.

翻译：在生成式信息检索中，瓶颈已从生成环节转向候选结果筛选环节，尤其对于文化相关性等规范性标准。当前基于大语言模型的评估方法常存在循环论证与偏好泄露问题，即监督模型与评估模型的重叠导致性能虚高。本研究通过将文化相关性形式化为查询内排序任务，提出一种无泄露的双评估框架，严格分离监督评估器与性能评估器。在新构建的包含33,052个文化背景故事的数据集上，研究发现：虽然经典基线模型仅产生有限增益，但从监督式交叉编码器蒸馏得到的稠密双编码器表现出显著效果。尽管交叉编码器为蒸馏提供了强监督信号，但经蒸馏的BGE-M3模型在无泄露评估环境下显著超越原模型。我们在人工标注的道德故事数据集上验证了该框架，显示出与人类规范的高度一致性。研究结果表明，严格的评估器分离是可信生成式信息检索评估的前提条件，并证明微妙的文化偏好能够在不泄露的情况下蒸馏至高效排序模型中。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

【WWW2025】G-Refer：基于图检索增强的大型语言模型用于可解释推荐

专知会员服务

13+阅读 · 2025年4月8日

多样化偏好优化

专知会员服务

12+阅读 · 2025年2月3日

基于因果推断的推荐系统去偏研究

专知会员服务

21+阅读 · 2024年11月10日

大语言模型评估技术研究进展

专知会员服务

49+阅读 · 2024年7月9日