A Meta-Evaluation of C/W/L/A Metrics: System Ranking Similarity, System Ranking Consistency and Discriminative Power

Recently, Moffat et al. proposed an analytic framework, namely C/W/L/A, for offline evaluation metrics. This framework allows information retrieval (IR) researchers to design evaluation metrics through the flexible combination of user browsing models and user gain aggregations. However, the statistical stability of C/W/L/A metrics with different aggregations is not yet investigated. In this study, we investigate the statistical stability of C/W/L/A metrics from the perspective of: (1) the system ranking similarity among aggregations, (2) the system ranking consistency of aggregations and (3) the discriminative power of aggregations. More specifically, we combined various aggregation functions with the browsing model of Precision, Discounted Cumulative Gain (DCG), Rank-Biased Precision (RBP), INST, Average Precision (AP) and Expected Reciprocal Rank (ERR), examing their performances in terms of system ranking similarity, system ranking consistency and discriminative power on two offline test collections. Our experimental result suggests that, in terms of system ranking consistency and discriminative power, the aggregation function of expected rate of gain (ERG) has an outstanding performance while the aggregation function of maximum relevance usually has an insufficient performance. The result also suggests that Precision, DCG, RBP, INST and AP with their canonical aggregation all have favourable performances in system ranking consistency and discriminative power; but for ERR, replacing its canonical aggregation with ERG can further strengthen the discriminative power while obtaining a system ranking list similar to the canonical version at the same time.

翻译：近期，Moffat 等人提出了一种基于 C/W/L/A 的分析框架，用于离线评估指标。该框架允许信息检索研究者通过灵活组合用户浏览模型与用户增益聚合方式，设计评估指标。然而，不同聚合方式下 C/W/L/A 指标的统计稳定性尚未得到充分研究。本研究从以下三个角度探究 C/W/L/A 指标的统计稳定性：（1）各聚合方式间的系统排序相似性；（2）各聚合方式的系统排序一致性；（3）各聚合方式的区分能力。具体而言，我们将多种聚合函数与 Precision、折损累计增益、排名偏置精度（RBP）、INST、平均精度（AP）及期望倒数排名（ERR）的浏览模型相结合，在两个离线测试集上检验其在系统排序相似性、系统排序一致性与区分能力方面的表现。实验结果表明，在系统排序一致性与区分能力方面，期望增益率（ERG）聚合函数表现突出，而最大相关性聚合函数通常性能不足。此外，Precision、DCG、RBP、INST 及 AP 在采用其规范聚合方式时，均展现出良好的系统排序一致性与区分能力；但对于 ERR，若将其规范聚合方式替换为 ERG，可在保持与规范版本相似的系统排序列表的同时，进一步增强其区分能力。