With the development of deep learning and natural language processing techniques, pre-trained language models have been widely used to solve information retrieval (IR) problems. Benefiting from the pre-training and fine-tuning paradigm, these models achieve state-of-the-art performance. In previous works, plain texts in Wikipedia have been widely used in the pre-training stage. However, the rich structured information in Wikipedia, such as the titles, abstracts, hierarchical heading (multi-level title) structure, relationship between articles, references, hyperlink structures, and the writing organizations, has not been fully explored. In this paper, we devise four pre-training objectives tailored for IR tasks based on the structured knowledge of Wikipedia. Compared to existing pre-training methods, our approach can better capture the semantic knowledge in the training corpus by leveraging the human-edited structured data from Wikipedia. Experimental results on multiple IR benchmark datasets show the superior performance of our model in both zero-shot and fine-tuning settings compared to existing strong retrieval baselines. Besides, experimental results in biomedical and legal domains demonstrate that our approach achieves better performance in vertical domains compared to previous models, especially in scenarios where long text similarity matching is needed.
翻译:随着深度学习与自然语言处理技术的发展,预训练语言模型已广泛用于解决信息检索问题。受益于预训练-微调范式,这些模型取得了最优性能。以往工作中,维基百科的纯文本内容被广泛应用于预训练阶段,但其丰富的结构化信息——如标题、摘要、层级标题结构、文章间关联、参考文献、超链接结构及行文组织——尚未被充分发掘。本文基于维基百科结构化知识,设计了四种面向信息检索任务的预训练目标。与现有预训练方法相比,我们的方法通过利用维基百科人工编辑的结构化数据,能更有效地捕捉训练语料中的语义知识。在多个信息检索基准数据集上的实验表明,我们的模型在零样本和微调设置下均展现出优于现有强检索基线的性能。此外,生物医学与法律领域的实验结果证明,相较于先前模型,本方法在垂直领域取得了更优表现,尤其在需要长文本相似度匹配的场景中。