In Generative Information Retrieval (GenIR), the bottleneck has shifted from generation to the selection of candidates, particularly for normative criteria such as cultural relevance. Current LLM-as-a-Judge evaluations often suffer from circularity and preference leakage, where overlapping supervision and evaluation models inflate performance. We address this by formalising cultural relevance as a within-query ranking task and introducing a leakage-free two-judge framework that strictly separates supervision (Judge B) from evaluation (Judge A). On a new benchmark of 33,052 (NGR-33k) culturally grounded stories, we find that while classical baselines yield only modest gains, a dense bi-encoder distilled from a Judge-B-supervised Cross-Encoder is highly effective. Although the Cross-Encoder provides a strong supervision signal for distillation, the distilled BGE-M3 model substantially outperforms it under leakage-free Judge~A evaluation. We validate our framework on the human-curated Moral Stories dataset, showing strong alignment with human norms. Our results demonstrate that rigorous evaluator separation is a prerequisite for credible GenIR evaluation, proving that subtle cultural preferences can be distilled into efficient rankers without leakage.
翻译:在生成式信息检索(GenIR)中,研究瓶颈已从生成过程转向候选结果的选择,特别是在文化相关性等规范性标准方面。当前基于大语言模型的评估方法常存在循环论证与偏好泄露问题,即监督模型与评估模型的重叠使用导致性能评估虚高。本研究通过将文化相关性形式化为查询内排序任务,并提出一种无泄露的双评估器框架来应对该问题,该框架严格分离监督功能(评估器B)与评估功能(评估器A)。基于新构建的包含33,052个文化背景故事的数据集(NGR-33k),研究发现:虽然经典基线方法仅能带来有限提升,但通过从评估器B监督的交叉编码器蒸馏得到的稠密双编码器表现出显著优势。尽管交叉编码器为蒸馏过程提供了强监督信号,但经蒸馏的BGE-M3模型在无泄露的评估器A测试中大幅超越其性能。我们在人工标注的Moral Stories数据集上验证了该框架,结果显示其与人类规范高度契合。本研究证明,严格的评估器分离是可信GenIR评估的前提条件,并证实了微妙的文化偏好能够在不泄露的情况下蒸馏至高效的排序模型中。