It is tempting to assume that because effectiveness metrics have free choice to assign scores to search engine result pages (SERPs) there must thus be a similar degree of freedom as to the relative order that SERP pairs can be put into. In fact that second freedom is, to a considerable degree, illusory. That's because if one SERP in a pair has been given a certain score by a metric, fundamental ordering constraints in many cases then dictate that the score for the second SERP must be either not less than, or not greater than, the score assigned to the first SERP. We refer to these fixed relationships as innate pairwise SERP orderings. Our first goal in this work is to describe and defend those pairwise SERP relationship constraints, and tabulate their relative occurrence via both exhaustive and empirical experimentation. We then consider how to employ such innate pairwise relationships in IR experiments, leading to a proposal for a new measurement paradigm. Specifically, we argue that tables of results in which many different metrics are listed for champion versus challenger system comparisons should be avoided; and that instead a single metric be argued for in principled terms, with any relationships identified by that metric then reinforced via an assessment of the innate relationship as to whether other metrics - indeed, all other metrics - are likely to yield the same system-vs-system outcome.
翻译:人们很容易假设,由于有效性指标在对搜索引擎结果页面(SERP)进行评分时具有自由选择权,因此在比较SERP对的相对排序时也存在类似程度的自由。事实上,这第二种自由在很大程度上是一种幻觉。这是因为,如果一对SERP中的某一个已被某个指标赋予了特定分数,那么在许多情况下,基本的排序约束会决定第二个SERP的分数必须不小于或不大于第一个SERP的分数。我们将这些固定关系称为SERP对的内在排序。本工作的首要目标是描述并论证这些SERP对关系的约束,并通过穷举和实证实验统计它们发生的相对频率。随后,我们探讨如何在信息检索实验中利用此类内在关系,从而提出一种新的测量范式。具体而言,我们认为应避免在冠军系统与挑战系统的比较结果表格中列出多种不同指标;相反,应基于原则性论证提出单一指标,并通过评估该指标所识别的任何关系的内生一致性——即其他所有指标是否可能得出相同的系统对比结论——来加以强化。