It is tempting to assume that because effectiveness metrics have free choice to assign scores to search engine result pages (SERPs) there must thus be a similar degree of freedom as to the relative order that SERP pairs can be put into. In fact that second freedom is, to a considerable degree, illusory. That's because if one SERP in a pair has been given a certain score by a metric, fundamental ordering constraints in many cases then dictate that the score for the second SERP must be either not less than, or not greater than, the score assigned to the first SERP. We refer to these fixed relationships as innate pairwise SERP orderings. Our first goal in this work is to describe and defend those pairwise SERP relationship constraints, and tabulate their relative occurrence via both exhaustive and empirical experimentation. We then consider how to employ such innate pairwise relationships in IR experiments, leading to a proposal for a new measurement paradigm. Specifically, we argue that tables of results in which many different metrics are listed for champion versus challenger system comparisons should be avoided; and that instead a single metric be argued for in principled terms, with any relationships identified by that metric then reinforced via an assessment of the innate relationship as to whether other metrics - indeed, all other metrics - are likely to yield the same system-vs-system outcome.
翻译:人们很容易假设,由于有效性度量在给搜索引擎结果页面(SERP)打分时拥有自由选择权,因此在SERP对的相对排序上也应存在类似程度的自由度。事实上,这种第二类自由在很大程度上是虚幻的。这是因为,如果某个度量已为一对SERP中的第一个页面赋予特定分数,那么在许多情况下,基本排序约束将决定第二个SERP的分数必须不小于或不大于第一个SERP的分数。我们将这些固定关系称为先天成对SERP排序。本文的首要目标是描述并论证这些成对SERP关系约束,并通过穷举实验和实证实验统计其相对出现频率。随后,我们探讨如何在信息检索实验中应用这些先天成对关系,从而提出一种新的测量范式。具体而言,我们认为应避免在冠军系统与挑战者系统的对比结果表格中列出多种不同度量;相反,应从原则性层面论证单一度量的使用,并通过评估该度量所识别的先天关系——即判断其他所有度量是否可能产生相同的系统间对比结果——来强化相关结论。