We propose a novel task-agnostic in-domain pre-training method that sits between generic pre-training and fine-tuning. Our approach selectively masks in-domain keywords, i.e., words that provide a compact representation of the target domain. We identify such keywords using KeyBERT (Grootendorst, 2020). We evaluate our approach using six different settings: three datasets combined with two distinct pre-trained language models (PLMs). Our results reveal that the fine-tuned PLMs adapted using our in-domain pre-training strategy outperform PLMs that used in-domain pre-training with random masking as well as those that followed the common pre-train-then-fine-tune paradigm. Further, the overhead of identifying in-domain keywords is reasonable, e.g., 7-15% of the pre-training time (for two epochs) for BERT Large (Devlin et al., 2019).
翻译:我们提出了一种新颖的任务无关领域内预训练方法,该方法位于通用预训练和微调之间。我们的方法选择性掩码领域关键词,即能够提供目标领域紧凑表征的词汇。我们利用KeyBERT(Grootendorst, 2020)识别此类关键词。我们通过六种不同设置评估了该方法:三个数据集与两种不同预训练语言模型(PLM)的组合。结果表明,采用领域内预训练策略适应的微调PLM,其性能优于使用随机掩码进行领域内预训练的PLM,以及遵循通用先预训练后微调范式的PLM。此外,识别领域关键词的额外开销合理,例如,针对BERT Large(Devlin等人,2019),该过程仅占用总预训练时间(两轮迭代)的7-15%。