Pre-trained contextual language models such as BERT, GPT, and XLnet work quite well for document retrieval tasks. Such models are fine-tuned based on the query-document/query-passage level relevance labels to capture the ranking signals. However, the documents are longer than the passages and such document ranking models suffer from the token limitation (512) of BERT. Researchers proposed ranking strategies that either truncate the documents beyond the token limit or chunk the documents into units that can fit into the BERT. In the later case, the relevance labels are either directly transferred from the original query-document pair or learned through some external model. In this paper, we conduct a detailed study of the design decisions about splitting and label transfer on retrieval effectiveness and efficiency. We find that direct transfer of relevance labels from documents to passages introduces label noise that strongly affects retrieval effectiveness for large training datasets. We also find that query processing times are adversely affected by fine-grained splitting schemes. As a remedy, we propose a careful passage level labelling scheme using weak supervision that delivers improved performance (3-14% in terms of nDCG score) over most of the recently proposed models for ad-hoc retrieval while maintaining manageable computational complexity on four diverse document retrieval datasets.
翻译:预训练上下文语言模型(如BERT、GPT和XLnet)在文档检索任务中表现优异。此类模型基于查询-文档/查询-段落级相关性标签进行微调,以捕获排序信号。然而,文档长度通常超过段落长度,此类文档排序模型受限于BERT的令牌长度限制(512个词元)。研究者提出了两种排序策略:一是截断超出令牌限制的文档,二是将文档分块为适配BERT的单元。后者中,相关性标签或直接从原始查询-文档对迁移,或通过外部模型习得。本文系统研究了文档拆分与标签迁移设计对检索效果与效率的影响。我们发现,将文档相关性标签直接迁移至段落会引入标签噪声,显著削弱大规模训练数据集的检索效果;同时,细粒度拆分方案会严重影响查询处理时间。为此,我们提出一种基于弱监督的段落级精细标注方案,在四个多样化文档检索数据集上,相较于近期提出的多数即席检索模型,该方案在维持可控计算复杂度的同时实现了性能提升(nDCG指标提升3-14%)。