There are settings in which reproducibility of ranked lists is desirable, such as when extracting a subset of an evolving document corpus for downstream research tasks or in domains such as patent retrieval or in medical systematic reviews, with high reproducibility expectations. However, as global term statistics change when documents change or are added to a corpus, queries using typical ranked retrieval models are not even reproducible for the parts of the document corpus that have not changed. Thus, Boolean retrieval frequently remains the mechanism of choice in such settings. We present a hybrid retrieval system combining Lucene for fast retrieval with a column-store-based retrieval system maintaining a versioned and time-stamped index. The latter component allows re-execution of previously posed queries resulting in the same ranked list and further allows for time-travel queries over evolving collection, as web archives, while maintaining the original ranking. Thus, retrieval results in evolving document collections are fully reproducible even when document collections and thus term statistics change.
翻译:在某些场景下,排序列表的可复现性是必要的,例如从演化文档语料库中提取子集用于下游研究任务,或在专利检索或医学系统综述等对可复现性要求较高的领域。然而,当文档变更或添加到语料库时,全局词项统计量会随之改变,使用典型排序检索模型的查询甚至对文档语料库中未变化的部分也无法实现复现。因此,布尔检索在这些场景中通常仍是首选机制。我们提出了一种混合检索系统,将用于快速检索的Lucene与基于列存储的检索系统相结合,后者维护一个带版本和时间戳的索引。该组件允许重新执行先前提出的查询并得到相同的排序列表,同时支持对演化集合(如网络存档)进行时间旅行查询,且保持原始排序不变。因此,即使文档集合及其词项统计量发生变化,演化文档集合中的检索结果仍具备完全可复现性。