Retrieval models are often evaluated on partially-annotated datasets. Each query is mapped to a few relevant texts and the remaining corpus is assumed to be irrelevant. As a result, models that successfully retrieve false negatives are punished in evaluation. Unfortunately, completely annotating all texts for every query is not resource efficient. In this work, we show that using partially-annotated datasets in evaluation can paint a distorted picture. We curate D-MERIT, a passage retrieval evaluation set from Wikipedia, aspiring to contain all relevant passages for each query. Queries describe a group (e.g., ``journals about linguistics'') and relevant passages are evidence that entities belong to the group (e.g., a passage indicating that Language is a journal about linguistics). We show that evaluating on a dataset containing annotations for only a subset of the relevant passages might result in misleading ranking of the retrieval systems and that as more relevant texts are included in the evaluation set, the rankings converge. We propose our dataset as a resource for evaluation and our study as a recommendation for balance between resource-efficiency and reliable evaluation when annotating evaluation sets for text retrieval.
翻译:检索模型通常在部分标注的数据集上进行评估。每个查询被映射到少数相关文本,而其余语料库则被假定为不相关。因此,在评估中,成功检索出假阴性样本的模型会受到惩罚。遗憾的是,为每个查询完全标注所有文本在资源上并不高效。在本研究中,我们证明在评估中使用部分标注的数据集可能导致扭曲的结果。我们构建了D-MERIT——一个基于维基百科的段落检索评估集,旨在包含每个查询的所有相关段落。查询描述一个群体(例如“语言学相关期刊”),相关段落则是实体属于该群体的证据(例如一段表明《语言》是语言学期刊的文字)。我们证明,在仅包含部分相关段落标注的数据集上进行评估,可能导致检索系统排名的误导性结果;随着评估集中包含更多相关文本,排名结果会趋于收敛。我们提出将本数据集作为评估资源,并建议在构建文本检索评估集的标注过程中,应在资源效率与评估可靠性之间寻求平衡。