With the development of deep learning and natural language processing techniques, pre-trained language models have been widely used to solve information retrieval (IR) problems. Benefiting from the pre-training and fine-tuning paradigm, these models achieve state-of-the-art performance. In previous works, plain texts in Wikipedia have been widely used in the pre-training stage. However, the rich structured information in Wikipedia, such as the titles, abstracts, hierarchical heading (multi-level title) structure, relationship between articles, references, hyperlink structures, and the writing organizations, has not been fully explored. In this paper, we devise four pre-training objectives tailored for IR tasks based on the structured knowledge of Wikipedia. Compared to existing pre-training methods, our approach can better capture the semantic knowledge in the training corpus by leveraging the human-edited structured data from Wikipedia. Experimental results on multiple IR benchmark datasets show the superior performance of our model in both zero-shot and fine-tuning settings compared to existing strong retrieval baselines. Besides, experimental results in biomedical and legal domains demonstrate that our approach achieves better performance in vertical domains compared to previous models, especially in scenarios where long text similarity matching is needed.
翻译:随着深度学习与自然语言处理技术的发展,预训练语言模型已被广泛应用于解决信息检索问题。得益于预训练-微调范式,这些模型取得了最先进的性能表现。在先前的工作中,维基百科的纯文本被广泛用于预训练阶段。然而,维基百科中丰富的结构化信息——包括标题、摘要、层级标题结构、文章间关联、参考文献、超链接结构及写作组织方式等——尚未得到充分挖掘。本文基于维基百科结构化知识,设计了四项专门针对信息检索任务的预训练目标。与现有预训练方法相比,本方法通过利用维基百科中人工编辑的结构化数据,能够更好地捕获训练语料中的语义知识。在多个信息检索基准数据集上的实验结果表明,本模型在零样本和微调设定下均展现出优于现有强检索基线的性能。此外,生物医学与法律领域的实验证明,本方法在垂直领域相比前代模型表现更优,尤其在需要长文本相似度匹配的场景中效果显著。