Lowering the numerical precision of model parameters and computations is widely adopted to improve the efficiency of retrieval systems. However, when computing relevance scores between the query and documents in low-precision, we observe spurious ties due to the reduced granularity. This introduces high variability in the results based on tie resolution, making the evaluation less reliable. To address this, we propose a more robust retrieval evaluation protocol designed to reduce score variation. It consists of: (1) High-Precision Scoring (HPS), which upcasts the final scoring step to higher precision to resolve tied candidates with minimal computational cost; and (2) Tie-aware Retrieval Metrics (TRM), which report expected scores, range, and bias to quantify order uncertainty of tied candidates. Our experiments test multiple models with three scoring functions on two retrieval datasets to demonstrate that HPS dramatically reduces tie-induced instability, and TRM accurately recovers expected metric values. This combination enables a more consistent and reliable evaluation system for lower-precision retrievals.
翻译:降低模型参数和计算的数值精度被广泛用于提升检索系统的效率。然而,当以低精度计算查询与文档之间的相关性分数时,我们观察到由于粒度降低而产生的伪并列分数。这导致基于并列分数解析的结果具有高度可变性,使得评估的可靠性下降。为解决此问题,我们提出了一种更鲁棒的检索评估协议,旨在减少分数变异。该协议包含:(1) 高精度评分(HPS),即将最终评分步骤上转为更高精度,以最小计算成本解析并列候选结果;(2) 并列感知检索指标(TRM),即报告期望分数、范围及偏差,以量化并列候选结果的顺序不确定性。我们的实验在两种检索数据集上测试了多个模型及三种评分函数,结果表明HPS显著降低了由并列分数引起的不稳定性,而TRM能准确恢复期望的指标值。这一组合为低精度检索提供了一个更一致且可靠的评估系统。