The accessibility of documents within a collection holds a pivotal role in Information Retrieval, signifying the ease of locating specific content in a collection of documents. This accessibility can be achieved via two distinct avenues. The first is through some retrieval model using a keyword or other feature-based search, and the other is where a document can be navigated using links associated with them, if available. Metrics such as PageRank, Hub, and Authority illuminate the pathways through which documents can be discovered within the network of content while the concept of Retrievability is used to quantify the ease with which a document can be found by a retrieval model. In this paper, we compare these two perspectives, PageRank and retrievability, as they quantify the importance and discoverability of content in a corpus. Through empirical experimentation on benchmark datasets, we demonstrate a subtle similarity between retrievability and PageRank particularly distinguishable for larger datasets.
翻译:文档在集合中的可访问性在信息检索中扮演着关键角色,体现了在文档集合中定位特定内容的难易程度。这种可访问性可通过两种不同途径实现:一是通过基于关键词或其他特征搜索的检索模型,二是利用文档关联链接进行导航(如存在此类链接)。PageRank、枢纽度(Hub)和权威度(Authority)等指标揭示了文档在网络内容中的发现路径,而检索能力(Retrievability)概念则用于量化检索模型找到文档的难易程度。本文通过对比分析这两种视角——PageRank与检索能力,研究它们如何量化语料库中内容的重要性和可发现性。基于基准数据集的实证实验表明,检索能力与PageRank之间存在微妙的相似性,尤其在较大数据集中表现更为显著。