We present Pre-trained Machine Reader (PMR), a novel method for retrofitting pre-trained masked language models (MLMs) to pre-trained machine reading comprehension (MRC) models without acquiring labeled data. PMR can resolve the discrepancy between model pre-training and downstream fine-tuning of existing MLMs. To build the proposed PMR, we constructed a large volume of general-purpose and high-quality MRC-style training data by using Wikipedia hyperlinks and designed a Wiki Anchor Extraction task to guide the MRC-style pre-training. Apart from its simplicity, PMR effectively solves extraction tasks, such as Extractive Question Answering and Named Entity Recognition. PMR shows tremendous improvements over existing approaches, especially in low-resource scenarios. When applied to the sequence classification task in the MRC formulation, PMR enables the extraction of high-quality rationales to explain the classification process, thereby providing greater prediction explainability. PMR also has the potential to serve as a unified model for tackling various extraction and classification tasks in the MRC formulation.
翻译:我们提出预训练机器阅读器(PMR),这是一种无需标注数据即可将预训练掩码语言模型(MLM)改造为预训练机器阅读理解(MRC)模型的新方法。PMR能解决现有MLM在模型预训练与下游微调之间的不一致性。为构建所提出的PMR,我们利用维基百科超链接构建了大量通用且高质量的MRC式训练数据,并设计了维基锚点提取任务来引导MRC式预训练。除了实现简便外,PMR能有效解决抽取式任务(如抽取式问答和命名实体识别)。PMR相较现有方法表现出显著提升,尤其在低资源场景下。当应用于MRC框架下的序列分类任务时,PMR能提取高质量解释因子以阐明分类过程,从而提供更强的预测可解释性。PMR还有潜力作为统一模型,解决MRC框架下的各类抽取与分类任务。