Existing technologies expand BERT from different perspectives, e.g. designing different pre-training tasks, different semantic granularities, and different model architectures. Few models consider expanding BERT from different text formats. In this paper, we propose a heterogeneous knowledge language model (\textbf{HKLM}), a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text. To capture the corresponding relations among these multi-format knowledge, our approach uses masked language model objective to learn word knowledge, uses triple classification objective and title matching objective to learn entity knowledge and topic knowledge respectively. To obtain the aforementioned multi-format text, we construct a corpus in the tourism domain and conduct experiments on 5 tourism NLP datasets. The results show that our approach outperforms the pre-training of plain text using only 1/4 of the data. We further pre-train the domain-agnostic HKLM and achieve performance gains on the XNLI dataset.
翻译:现有技术从不同维度扩展BERT,例如设计不同的预训练任务、不同语义粒度和不同模型架构,但鲜有模型考虑从不同文本格式角度扩展BERT。本文提出异质知识语言模型(HKLM),这是一种面向非结构化文本、半结构化文本和结构化文本等所有文本格式的统一预训练语言模型。为捕捉这些多格式知识间的对应关系,本方法采用掩码语言模型目标学习词汇知识,采用三元分类目标和标题匹配目标分别学习实体知识和主题知识。为获取上述多格式文本,我们构建了旅游领域语料库,并在5个旅游领域NLP数据集上进行实验。结果表明,本方法仅使用四分之一的数据量即可超越纯文本预训练效果。我们进一步预训练了领域无关的HKLM,并在XNLI数据集上取得了性能提升。