Understanding how and what pre-trained language models (PLMs) learn about language is an open challenge in natural language processing. Previous work has focused on identifying whether they capture semantic and syntactic information, and how the data or the pre-training objective affects their performance. However, to the best of our knowledge, no previous work has specifically examined how information loss in input token characters affects the performance of PLMs. In this study, we address this gap by pre-training language models using small subsets of characters from individual tokens. Surprisingly, we find that pre-training even under extreme settings, i.e. using only one character of each token, the performance retention in standard NLU benchmarks and probing tasks compared to full-token models is high. For instance, a model pre-trained only on single first characters from tokens achieves performance retention of approximately $90$\% and $77$\% of the full-token model in SuperGLUE and GLUE tasks, respectively.
翻译:理解预训练语言模型(PLMs)如何以及从语言中学习什么,是自然语言处理领域的一项开放挑战。以往的研究主要聚焦于确定它们是否捕捉到语义和句法信息,以及数据或预训练目标如何影响其性能。然而,据我们所知,尚无研究专门探讨输入标记字符的信息丢失如何影响PLMs的性能。在本研究中,我们通过使用单个标记中字符的小子集来预训练语言模型,填补了这一空白。令人惊讶的是,我们发现即使在极端设置下(即每个标记仅使用一个字符)进行预训练,模型在标准NLU基准测试和探测任务中的性能保持率相较于完整标记模型依然较高。例如,仅使用标记的第一个字符进行预训练的模型,在SuperGLUE和GLUE任务中的性能保持率分别约为完整标记模型的90%和77%。