Pre-trained contextual language models such as BERT, GPT, and XLnet work quite well for document retrieval tasks. Such models are fine-tuned based on the query-document/query-passage level relevance labels to capture the ranking signals. However, the documents are longer than the passages and such document ranking models suffer from the token limitation (512) of BERT. Researchers proposed ranking strategies that either truncate the documents beyond the token limit or chunk the documents into units that can fit into the BERT. In the later case, the relevance labels are either directly transferred from the original query-document pair or learned through some external model. In this paper, we conduct a detailed study of the design decisions about splitting and label transfer on retrieval effectiveness and efficiency. We find that direct transfer of relevance labels from documents to passages introduces label noise that strongly affects retrieval effectiveness for large training datasets. We also find that query processing times are adversely affected by fine-grained splitting schemes. As a remedy, we propose a careful passage level labelling scheme using weak supervision that delivers improved performance (3-14% in terms of nDCG score) over most of the recently proposed models for ad-hoc retrieval while maintaining manageable computational complexity on four diverse document retrieval datasets.
翻译:预训练上下文语言模型(如BERT、GPT和XLnet)在文档检索任务中表现出色。这类模型基于查询-文档/查询-段落级相关性标签进行微调,以捕获排序信号。然而,文档长度通常超过段落,且此类文档排序模型受限于BERT的令牌限制(512个令牌)。研究者提出的排序策略要么截断超出令牌限制的文档,要么将文档切分为适配BERT的单元。在后一种情况下,相关性标签或直接从原始查询-文档对迁移,或通过外部模型学习。本文针对切分与标签迁移的设计决策对检索有效性与效率的影响开展了系统研究。研究发现,将文档级相关性标签直接迁移至段落会引入标签噪声,显著影响大规模训练数据集下的检索效果;同时,细粒度切分方案会严重影响查询处理时间。作为解决方案,我们提出了一种基于弱监督的精细段落级标签标注方案,在四个多样性文档检索数据集上,该方案相较近期多数专为即需检索设计的模型,性能提升达3%-14%(以nDCG指标衡量),同时保持了可控的计算复杂度。