Passage ranking involves two stages: passage retrieval and passage re-ranking, which are important and challenging topics for both academics and industries in the area of Information Retrieval (IR). However, the commonly-used datasets for passage ranking usually focus on the English language. For non-English scenarios, such as Chinese, the existing datasets are limited in terms of data scale, fine-grained relevance annotation and false negative issues. To address this problem, we introduce T2Ranking, a large-scale Chinese benchmark for passage ranking. T2Ranking comprises more than 300K queries and over 2M unique passages from real-world search engines. Expert annotators are recruited to provide 4-level graded relevance scores (fine-grained) for query-passage pairs instead of binary relevance judgments (coarse-grained). To ease the false negative issues, more passages with higher diversities are considered when performing relevance annotations, especially in the test set, to ensure a more accurate evaluation. Apart from the textual query and passage data, other auxiliary resources are also provided, such as query types and XML files of documents which passages are generated from, to facilitate further studies. To evaluate the dataset, commonly used ranking models are implemented and tested on T2Ranking as baselines. The experimental results show that T2Ranking is challenging and there is still scope for improvement. The full data and all codes are available at https://github.com/THUIR/T2Ranking/
翻译:段落排序包含段落检索与段落重排序两个阶段,这是信息检索领域学术界与工业界共同关注的重要且具有挑战性的课题。然而,当前常用的段落排序数据集多聚焦于英文场景。针对中文等非英语场景,现有数据集在数据规模、细粒度相关性标注及假负例问题方面存在局限。为解决这一问题,我们提出T2Ranking——一个面向段落排序的大规模中文基准数据集。T2Ranking包含来自真实搜索引擎的30余万条查询及超过200万条独立段落。我们招募领域专家对查询-段落对进行四级分级相关性评分(细粒度标注),而非二元相关性判断(粗粒度标注)。为缓解假负例问题,在执行相关性标注时(尤其在测试集中)纳入更多高多样性段落,从而确保评估结果的准确性。除文本型查询与段落数据外,我们还提供查询类型、段落来源文档的XML文件等辅助资源,以促进深入研究。为评估该数据集,我们在T2Ranking上实现了多种常用排序模型作为基准。实验结果表明,T2Ranking具有挑战性且仍有提升空间。完整数据及所有代码均可在https://github.com/THUIR/T2Ranking/获取。