Practical natural language processing (NLP) tasks are commonly long-tailed with noisy labels. Those problems challenge the generalization and robustness of complex models such as Deep Neural Networks (DNNs). Some commonly used resampling techniques, such as oversampling or undersampling, could easily lead to overfitting. It is growing popular to learn the data weights leveraging a small amount of metadata. Besides, recent studies have shown the advantages of self-supervised pre-training, particularly to the under-represented data. In this work, we propose a general framework to handle the problem of both long-tail and noisy labels. The model is adapted to the domain of problems in a contrastive learning manner. The re-weighting module is a feed-forward network that learns explicit weighting functions and adapts weights according to metadata. The framework further adapts weights of terms in the loss function through a combination of the polynomial expansion of cross-entropy loss and focal loss. Our extensive experiments show that the proposed framework consistently outperforms baseline methods. Lastly, our sensitive analysis emphasizes the capability of the proposed framework to handle the long-tailed problem and mitigate the negative impact of noisy labels.
翻译:实际自然语言处理(NLP)任务通常呈现长尾分布并伴有噪声标签。这些问题对深度神经网络等复杂模型的泛化能力与鲁棒性构成挑战。传统的重采样技术(如过采样或欠采样)容易导致过拟合。利用少量元数据学习数据权重的方法日益普及。此外,近期研究表明自监督预训练对低表征数据具有显著优势。本文提出一个通用框架以同时处理长尾分布与噪声标签问题。该模型通过对比学习方式适配特定问题领域,其重加权模块采用前馈网络学习显式加权函数,并根据元数据动态调整权重。该框架进一步通过交叉熵损失与焦点损失的多项式展开组合优化损失函数中各权重项。大量实验表明,所提框架在性能上持续优于基线方法。最后,敏感性分析证实该框架能够有效应对长尾问题并削弱噪声标签的负面影响。