How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection

Noisy training data can significantly degrade the performance of language-model-based classifiers, particularly in non-topical classification tasks. In this study we designed a methodological framework to assess the impact of denoising. More specifically, we explored a range of denoising strategies for sentence-level difficulty detection, using training data derived from document-level difficulty annotations obtained through noisy crowdsourcing. Beyond monolingual settings, we also address cross-lingual transfer, where a multilingual language model is trained in one language and tested in another. We evaluate several noise reduction techniques, including Gaussian Mixture Models (GMM), Co-Teaching, Noise Transition Matrices, and Label Smoothing. Our results indicate that while BERT-based models exhibit inherent robustness to noise, incorporating explicit noise detection can further enhance performance. For our smaller dataset, GMM-based noise filtering proves particularly effective in improving prediction quality by raising the Area-Under-the-Curve score from 0.52 to 0.92, or to 0.93 when de-noising methods are combined. However, for our larger dataset, the intrinsic regularisation of pre-trained language models provides a strong baseline, with denoising methods yielding only marginal gains (from 0.92 to 0.94, while a combination of two denoising methods made no contribution). Nonetheless, removing noisy sentences (about 20\% of the dataset) helps in producing a cleaner corpus with fewer infelicities. As a result we have released the largest multilingual corpus for sentence difficulty prediction: see https://github.com/Nouran-Khallaf/denoising-difficulty

翻译：噪声训练数据会显著降低基于语言模型的分类器性能，尤其在非主题分类任务中。本研究设计了一个方法学框架来评估去噪效果。具体而言，我们针对句子级难度检测任务，探索了一系列去噪策略，所用训练数据源自通过噪声众包获得的文档级难度标注。除单语场景外，我们还研究了跨语言迁移场景，即使用多语言模型在一种语言上训练并在另一种语言上测试。我们评估了多种降噪技术，包括高斯混合模型、协同教学、噪声转移矩阵和标签平滑。结果表明：虽然基于BERT的模型对噪声具有固有鲁棒性，但引入显式噪声检测能进一步提升性能。在较小数据集上，基于GMM的噪声过滤通过将曲线下面积分数从0.52提升至0.92（结合多种去噪方法可达0.93），被证明能有效改善预测质量。然而在较大数据集上，预训练语言模型的内在正则化提供了强劲基线，去噪方法仅带来边际增益（从0.92提升至0.94，而两种去噪方法的组合未产生贡献）。尽管如此，移除噪声句子（约占数据集的20%）有助于构建更纯净、更少瑕疵的语料库。基于此，我们发布了目前最大的多语言句子难度预测语料库：详见https://github.com/Nouran-Khallaf/denoising-difficulty