Practical natural language processing (NLP) tasks are commonly long-tailed with noisy labels. Those problems challenge the generalization and robustness of complex models such as Deep Neural Networks (DNNs). Some commonly used resampling techniques, such as oversampling or undersampling, could easily lead to overfitting. It is growing popular to learn the data weights leveraging a small amount of metadata. Besides, recent studies have shown the advantages of self-supervised pre-training, particularly to the under-represented data. In this work, we propose a general framework to handle the problem of both long-tail and noisy labels. The model is adapted to the domain of problems in a contrastive learning manner. The re-weighting module is a feed-forward network that learns explicit weighting functions and adapts weights according to metadata. The framework further adapts weights of terms in the loss function through a combination of the polynomial expansion of cross-entropy loss and focal loss. Our extensive experiments show that the proposed framework consistently outperforms baseline methods. Lastly, our sensitive analysis emphasizes the capability of the proposed framework to handle the long-tailed problem and mitigate the negative impact of noisy labels.
翻译:实际自然语言处理(NLP)任务通常具有长尾分布且包含噪声标签,这些问题挑战了深度神经网络(DNN)等复杂模型的泛化能力和鲁棒性。传统的重采样技术(如过采样或欠采样)容易导致过拟合。目前利用少量元数据学习数据权重的方法日益流行,同时近期研究表明自监督预训练对欠表征数据具有显著优势。本文提出一个通用框架来解决长尾分布与噪声标签问题:首先通过对比学习方式使模型适应问题领域;基于前馈网络的权重重分配模块学习显式权重函数并根据元数据自适应调整权重;进一步通过交叉熵损失与focal损失的多项式展开组合来调整损失函数中各分项的权重。大量实验证明,所提框架始终优于基线方法。敏感性分析进一步凸显了该框架处理长尾问题及缓解噪声标签负面影响的强大能力。