This study examines the role of uncertainty estimation (UE) methods in multilingual text classification under noisy and non-topical conditions. Using a complex-vs-simple sentence classification task across several languages, we evaluate a range of UE techniques against a range of metrics to assess their contribution to making more robust predictions. Results indicate that while methods relying on softmax outputs remain competitive in high-resource in-domain settings, their reliability declines in low-resource or domain-shift scenarios. In contrast, Monte Carlo dropout approaches demonstrate consistently strong performance across all languages, offering more robust calibration, stable decision thresholds, and greater discriminative power even under adverse conditions. We further demonstrate the positive impact of UE on non-topical classification: abstaining from predicting the 10\% most uncertain instances increases the macro F1 score from 0.81 to 0.85 in the Readme task. By integrating UE with trustworthiness metrics, this study provides actionable insights for developing more reliable NLP systems in real-world multilingual environments. See https://github.com/Nouran-Khallaf/To-Predict-or-Not-to-Predict
翻译:本研究探讨了不确定性估计方法在多语言文本分类任务中,面对噪声和非主题性干扰时的作用。通过在多语种环境下执行复杂句与简单句分类任务,我们采用多种评估指标对一系列不确定性估计技术进行系统性评测,以评估其对提升预测鲁棒性的贡献。结果表明:虽然依赖softmax输出的方法在高资源领域内场景中仍具竞争力,但其可靠性在低资源或领域偏移情境下显著下降。相比之下,蒙特卡洛dropout方法在所有语言中均表现出稳定优异的性能,即使在不利条件下也能提供更稳健的校准效果、更稳定的决策阈值和更强的判别能力。我们进一步证明了不确定性估计对非主题分类的积极影响:在Readme任务中,通过放弃预测10%最不确定的样本,宏观F1分数从0.81提升至0.85。通过将不确定性估计与可信度指标相结合,本研究为在现实多语言环境中开发更可靠的NLP系统提供了可操作的见解。详见https://github.com/Nouran-Khallaf/To-Predict-or-Not-to-Predict