For Pretrained Language Models (PLMs), their susceptibility to noise has recently been linked to subword segmentation. However, it is unclear which aspects of segmentation affect their understanding. This study assesses the robustness of PLMs against various disrupted segmentation caused by noise. An evaluation framework for subword segmentation, named Contrastive Lexical Semantic (CoLeS) probe, is proposed. It provides a systematic categorization of segmentation corruption under noise and evaluation protocols by generating contrastive datasets with canonical-noisy word pairs. Experimental results indicate that PLMs are unable to accurately compute word meanings if the noise introduces completely different subwords, small subword fragments, or a large number of additional subwords, particularly when they are inserted within other subwords.
翻译:对于预训练语言模型(PLMs),其对噪声的敏感性近来被归因于子词分割。然而,分割的哪些方面影响其理解尚不明确。本研究评估了PLMs针对噪声引起的各种分割紊乱的鲁棒性。我们提出了一种名为对比词汇语义(CoLeS)探针的子词分割评估框架。该框架通过生成包含规范-噪声词对的对比数据集,对噪声下的分割损害进行了系统分类,并制定了评估方案。实验结果表明,若噪声引入完全不同的子词、微小子词片段或大量额外子词(尤其是插入其他子词内时),PLMs将无法准确计算词义。