Masked Language Models (MLMs) have achieved remarkable success in many self-supervised representation learning tasks. MLMs are trained by randomly replacing some tokens in the input sentences with [MASK] tokens and predicting the original tokens based on the remaining context. This paper explores the impact of [MASK] tokens on MLMs. Analytical studies show that masking tokens can introduce the corrupted semantics problem, wherein the corrupted context may convey multiple, ambiguous meanings. This problem is also a key factor affecting the performance of MLMs on downstream tasks. Based on these findings, we propose a novel enhanced-context MLM, ExLM. Our approach expands [MASK] tokens in the input context and models the dependencies between these expanded states. This expansion increases context capacity and enables the model to capture richer semantic information, effectively mitigating the corrupted semantics problem during pre-training. Experimental results demonstrate that ExLM achieves significant performance improvements in both text modeling and SMILES modeling tasks. Further analysis confirms that ExLM enhances semantic representations through context enhancement, and effectively reduces the multimodality problem commonly observed in MLMs.
翻译:掩码语言模型(MLMs)在许多自监督表示学习任务中取得了显著成功。MLMs的训练方式是在输入句子中随机将部分标记替换为[MASK]标记,并基于剩余上下文预测原始标记。本文探讨了[MASK]标记对MLMs的影响。分析研究表明,掩码操作可能引发语义损坏问题,即损坏后的上下文可能传达多重、模糊的含义。该问题也是影响MLMs在下游任务性能的关键因素。基于这些发现,我们提出了一种新颖的增强上下文掩码语言模型ExLM。该方法在输入上下文中扩展[MASK]标记,并对这些扩展状态间的依赖关系进行建模。这种扩展增加了上下文容量,使模型能够捕捉更丰富的语义信息,有效缓解预训练过程中的语义损坏问题。实验结果表明,ExLM在文本建模和SMILES建模任务中均取得了显著的性能提升。进一步分析证实,ExLM通过上下文增强强化了语义表示,并有效减少了MLMs中常见的多模态问题。