Dense retrieval models are typically fine-tuned with contrastive learning objectives that require binary relevance judgments, even though relevance is inherently graded. We analyze how graded relevance scores and the threshold used to convert them into binary labels affect multilingual dense retrieval. Using a multilingual dataset with LLM-annotated relevance scores, we examine monolingual, multilingual mixture, and cross-lingual retrieval scenarios. Our findings show that the optimal threshold varies systematically across languages and tasks, often reflecting differences in resource level. A well-chosen threshold can improve effectiveness, reduce the amount of fine-tuning data required, and mitigate annotation noise, whereas a poorly chosen one can degrade performance. We argue that graded relevance is a valuable but underutilized signal for dense retrieval, and that threshold calibration should be treated as a principled component of the fine-tuning pipeline.
翻译:稠密检索模型通常通过对比学习目标进行微调,该目标需要二元相关性判断,尽管相关性本质上是渐进的。我们分析了梯度相关性分数及其转换为二元标签时所使用的阈值如何影响多语言稠密检索。利用一个包含LLM标注相关性分数的多语言数据集,我们考察了单语言、多语言混合以及跨语言检索场景。我们的研究结果表明,最优阈值在不同语言和任务间存在系统性差异,这种差异往往反映了资源水平的差别。精心选择的阈值能够提升检索效果、减少微调所需的数据量并缓解标注噪声,而选择不当的阈值则可能导致性能下降。我们认为梯度相关性是稠密检索中一种有价值但未充分利用的信号,阈值校准应被视为微调流程中一个原则性的组成部分。