Learning from noisy labels is a challenge that arises in many real-world applications where training data can contain incorrect or corrupted labels. When fine-tuning language models with noisy labels, models can easily overfit the label noise, leading to decreased performance. Most existing methods for learning from noisy labels use static input features for denoising, but these methods are limited by the information they can provide on true label distributions and can result in biased or incorrect predictions. In this work, we propose the Dynamics-Enhanced Generative Model (DyGen), which uses dynamic patterns in the embedding space during the fine-tuning process of language models to improve noisy label predictions. DyGen uses the variational auto-encoding framework to infer the posterior distributions of true labels from noisy labels and training dynamics. Additionally, a co-regularization mechanism is used to minimize the impact of potentially noisy labels and priors. DyGen demonstrates an average accuracy improvement of 3.10% on two synthetic noise datasets and 1.48% on three real-world noise datasets compared to the previous state-of-the-art. Extensive experiments and analyses show the effectiveness of each component in DyGen. Our code is available for reproducibility on GitHub.
翻译:从噪声标签中学习是许多现实应用中面临的挑战,当训练数据包含错误或损坏的标签时,模型容易过拟合标签噪声,导致性能下降。现有大多数噪声标签学习方法使用静态输入特征进行去噪,但这些方法受限于所能提供的真实标签分布信息,可能导致有偏或错误的预测。本文提出动力学增强生成模型(DyGen),通过捕捉语言模型微调过程中嵌入空间的动态模式来改进噪声标签预测。DyGen采用变分自编码框架从噪声标签和训练动力学中推断真实标签的后验分布,并引入共正则化机制来降低潜在噪声标签与先验的影响。在合成噪声数据集上,DyGen相较先前最优方法平均准确率提升3.10%,在三个真实世界噪声数据集上提升1.48%。大量实验与分析证明了DyGen各组件的有效性。我们的代码已在GitHub开源以供可复现研究。