Noisy training data can significantly degrade the performance of language-model-based classifiers, particularly in non-topical classification tasks. In this study we designed a methodological framework to assess the impact of denoising. More specifically, we explored a range of denoising strategies for sentence-level difficulty detection, using training data derived from document-level difficulty annotations obtained through noisy crowdsourcing. Beyond monolingual settings, we also address cross-lingual transfer, where a multilingual language model is trained in one language and tested in another. We evaluate several noise reduction techniques, including Gaussian Mixture Models (GMM), Co-Teaching, Noise Transition Matrices, and Label Smoothing. Our results indicate that while BERT-based models exhibit inherent robustness to noise, incorporating explicit noise detection can further enhance performance. For our smaller dataset, GMM-based noise filtering proves particularly effective in improving prediction quality by raising the Area-Under-the-Curve score from 0.52 to 0.92, or to 0.93 when de-noising methods are combined. However, for our larger dataset, the intrinsic regularisation of pre-trained language models provides a strong baseline, with denoising methods yielding only marginal gains (from 0.92 to 0.94, while a combination of two denoising methods made no contribution). Nonetheless, removing noisy sentences (about 20\% of the dataset) helps in producing a cleaner corpus with fewer infelicities. As a result we have released the largest multilingual corpus for sentence difficulty prediction: see https://github.com/Nouran-Khallaf/denoising-difficulty
翻译:噪声训练数据会显著降低基于语言模型的分类器性能,尤其在非主题分类任务中。本研究设计了一个方法学框架来评估去噪效果。具体而言,我们针对句子级难度检测任务,探索了一系列去噪策略,所用训练数据源自通过噪声众包获得的文档级难度标注。除单语场景外,我们还研究了跨语言迁移场景,即使用多语言模型在一种语言上训练并在另一种语言上测试。我们评估了多种降噪技术,包括高斯混合模型、协同教学、噪声转移矩阵和标签平滑。结果表明:虽然基于BERT的模型对噪声具有固有鲁棒性,但引入显式噪声检测能进一步提升性能。在较小数据集上,基于GMM的噪声过滤通过将曲线下面积分数从0.52提升至0.92(结合多种去噪方法可达0.93),被证明能有效改善预测质量。然而在较大数据集上,预训练语言模型的内在正则化提供了强劲基线,去噪方法仅带来边际增益(从0.92提升至0.94,而两种去噪方法的组合未产生贡献)。尽管如此,移除噪声句子(约占数据集的20%)有助于构建更纯净、更少瑕疵的语料库。基于此,我们发布了目前最大的多语言句子难度预测语料库:详见https://github.com/Nouran-Khallaf/denoising-difficulty