The MS MARCO-passage dataset has been the main large-scale dataset open to the IR community and it has fostered successfully the development of novel neural retrieval models over the years. But, it turns out that two different corpora of MS MARCO are used in the literature, the official one and a second one where passages were augmented with titles, mostly due to the introduction of the Tevatron code base. However, the addition of titles actually leaks relevance information, while breaking the original guidelines of the MS MARCO-passage dataset. In this work, we investigate the differences between the two corpora and demonstrate empirically that they make a significant difference when evaluating a new method. In other words, we show that if a paper does not properly report which version is used, reproducing fairly its results is basically impossible. Furthermore, given the current status of reviewing, where monitoring state-of-the-art results is of great importance, having two different versions of a dataset is a large problem. This is why this paper aims to report the importance of this issue so that researchers can be made aware of this problem and appropriately report their results.
翻译:MS MARCO-passage数据集是信息检索领域开放的主要大规模数据集,多年来成功推动了新型神经检索模型的发展。但文献中实际上使用了两个不同的MS MARCO语料库:官方版本和另一个主要由Tevatron代码库引入的、对段落添加了标题的版本。然而,添加标题的做法实际上泄露了相关性信息,同时违背了MS MARCO-passage数据集的原始设计准则。本研究调查了两个语料库之间的差异,并通过实验证明,在评估新方法时它们会造成显著差异。换言之,我们证明如果论文未明确说明使用的是哪个版本,则几乎不可能公平地复现其结果。此外,在当前审稿环境中,跟踪最新最优结果至关重要,因此数据集存在两个不同版本是一个重大隐患。为此,本文旨在强调问题的重要性,使研究人员能够意识到这一缺陷并正确报告其结果。