Masked Language Modeling (MLM) is widely used to pretrain language models. The standard random masking strategy in MLM causes the pre-trained language models (PLMs) to be biased toward high-frequency tokens. Representation learning of rare tokens is poor and PLMs have limited performance on downstream tasks. To alleviate this frequency bias issue, we propose two simple and effective Weighted Sampling strategies for masking tokens based on the token frequency and training loss. We apply these two strategies to BERT and obtain Weighted-Sampled BERT (WSBERT). Experiments on the Semantic Textual Similarity benchmark (STS) show that WSBERT significantly improves sentence embeddings over BERT. Combining WSBERT with calibration methods and prompt learning further improves sentence embeddings. We also investigate fine-tuning WSBERT on the GLUE benchmark and show that Weighted Sampling also improves the transfer learning capability of the backbone PLM. We further analyze and provide insights into how WSBERT improves token embeddings.
翻译:掩码语言建模(MLM)被广泛用于预训练语言模型。MLM中的标准随机掩码策略导致预训练语言模型(PLMs)偏向于高频词元,低频词元的表示学习效果较差,进而影响下游任务的性能。为缓解这一频率偏差问题,我们提出两种简单有效的加权采样策略,分别基于词元频率和训练损失进行掩码。我们将这两种策略应用于BERT,得到加权采样BERT(WSBERT)。在语义文本相似度基准(STS)上的实验表明,WSBERT显著提升了BERT的句子嵌入质量。结合校准方法与提示学习可进一步增强句子嵌入性能。我们在GLUE基准上对WSBERT进行微调,证明加权采样同样改进了骨干PLM的迁移学习能力。进一步分析揭示了WSBERT改进词元嵌入的机制,并提供相关见解。