Masked Language Models (MLMs) have achieved remarkable success in many self-supervised representation learning tasks. MLMs are trained by randomly masking portions of the input sequences with [MASK] tokens and learning to reconstruct the original content based on the remaining context. This paper explores the impact of [MASK] tokens on MLMs. Analytical studies show that masking tokens can introduce the corrupted semantics problem, wherein the corrupted context may convey multiple, ambiguous meanings. This problem is also a key factor affecting the performance of MLMs on downstream tasks. Based on these findings, we propose a novel enhanced-context MLM, ExLM. Our approach expands [MASK] tokens in the input context and models the dependencies between these expanded states. This enhancement increases context capacity and enables the model to capture richer semantic information, effectively mitigating the corrupted semantics problem during pre-training. Experimental results demonstrate that ExLM achieves significant performance improvements in both text modeling and SMILES modeling tasks. Further analysis confirms that ExLM enriches semantic representations through context enhancement, and effectively reduces the semantic multimodality commonly observed in MLMs.
翻译:掩码语言模型(MLMs)在许多自监督表示学习任务中取得了显著成功。MLMs的训练方式是在输入序列中随机用[MASK]标记遮蔽部分内容,并基于剩余上下文学习重建原始内容。本文探讨了[MASK]标记对MLMs的影响。分析研究表明,遮蔽标记可能引发语义损坏问题,即损坏后的上下文可能传达多重模糊含义。该问题同样是影响MLMs在下游任务性能的关键因素。基于这些发现,我们提出了一种新颖的增强上下文掩码语言模型ExLM。该方法在输入上下文中扩展[MASK]标记,并对这些扩展状态间的依赖关系进行建模。这种增强提升了上下文容量,使模型能够捕获更丰富的语义信息,有效缓解预训练过程中的语义损坏问题。实验结果表明,ExLM在文本建模和SMILES建模任务中均取得了显著的性能提升。进一步分析证实,ExLM通过上下文增强丰富了语义表示,并有效降低了MLMs中常见的语义多模态现象。