In Generative Information Retrieval (GenIR), the bottleneck has shifted from generation to the selection of candidates, particularly for normative criteria such as cultural relevance. Current LLM-as-a-Judge evaluations often suffer from circularity and preference leakage, where overlapping supervision and evaluation models inflate performance. We address this by formalising cultural relevance as a within-query ranking task and introducing a leakage-free two-judge framework that strictly separates supervision (Judge B) from evaluation (Judge A). On a new benchmark of 33,052 (NGR-33k) culturally grounded stories, we find that while classical baselines yield only modest gains, a dense bi-encoder distilled from a Judge-B-supervised Cross-Encoder is highly effective. Although the Cross-Encoder provides a strong supervision signal for distillation, the distilled BGE-M3 model substantially outperforms it under leakage-free Judge~A evaluation. We validate our framework on the human-curated Moral Stories dataset, showing strong alignment with human norms. Our results demonstrate that rigorous evaluator separation is a prerequisite for credible GenIR evaluation, proving that subtle cultural preferences can be distilled into efficient rankers without leakage.
翻译:在生成式信息检索中,瓶颈已从生成环节转向候选结果筛选环节,尤其对于文化相关性等规范性标准。当前基于大语言模型的评估方法常存在循环论证与偏好泄露问题,即监督模型与评估模型的重叠导致性能虚高。本研究通过将文化相关性形式化为查询内排序任务,提出一种无泄露的双评估框架,严格分离监督评估器与性能评估器。在新构建的包含33,052个文化背景故事的数据集上,研究发现:虽然经典基线模型仅产生有限增益,但从监督式交叉编码器蒸馏得到的稠密双编码器表现出显著效果。尽管交叉编码器为蒸馏提供了强监督信号,但经蒸馏的BGE-M3模型在无泄露评估环境下显著超越原模型。我们在人工标注的道德故事数据集上验证了该框架,显示出与人类规范的高度一致性。研究结果表明,严格的评估器分离是可信生成式信息检索评估的前提条件,并证明微妙的文化偏好能够在不泄露的情况下蒸馏至高效排序模型中。