The ClueWeb22 dataset containing nearly 10 billion documents was released in 2022 to support academic and industry research. The goal of this project was to build retrieval baselines for the English section of the "super head" part (category B) of this dataset. These baselines can then be used by the research community to compare their systems and also to generate data to train/evaluate new retrieval and ranking algorithms. The report covers sparse and dense first stage retrievals as well as neural rerankers that were implemented for this dataset. These systems are available as a service on a Carnegie Mellon University cluster.
翻译:ClueWeb22数据集包含近100亿篇文档,于2022年发布,旨在支持学术与工业研究。本项目的目标是为该数据集中"超头部"部分(B类)的英文子集构建检索基线系统。研究界可利用这些基线系统进行系统对比,并生成用于训练/评估新型检索与排序算法的数据。本报告涵盖针对该数据集实现的稀疏与稠密第一阶段检索方法,以及神经重排序器。这些系统已作为服务部署于卡内基梅隆大学集群。