Inaccurately labeled training data, or "label noise", poses a significant threat to the integrity of supervised machine learning models. This corruption directly degrades performance by teaching the model erroneous mappings between features and labels, which leads to poor generalization and reduced accuracy on properly labeled validation and test data. Current seismological applications mainly rely on large-scale training sets or data augmentation to reduce the label-noise impact, which can be labor-intensive and costly. Here, we introduce a Label Noise-Contrastive Robust Learning (LaNCoR) approach that can effectively handle noisy labels in seismic signal processing tasks, without requiring large-scale training datasets. In this approach, the input waveform feature and label representation distributions are aligned in the feature space to correct mislabeling and reduce its impact on the training process. We present LaNCoR's performance on the task of P-phase arrival-time picking of real microseismic data using two baseline models and training approaches. Our results indicate that LaNCoR can improve performance by up to 28.8% across performance metrics. This approach holds great promise for model training in seismology and geosciences.
翻译:不准确标注的训练数据(即“标签噪声”)对监督式机器学习模型的完整性构成严重威胁。这种噪声通过教导模型在特征与标签之间建立错误映射关系,直接导致模型性能下降,造成泛化能力减弱以及对正确标注的验证集和测试集的准确性降低。当前地震学应用主要依赖大规模训练集或数据增强来降低标签噪声的影响,但这往往需要大量人力和成本。本文提出一种标签噪声对比鲁棒学习方法(LaNCoR),能够在无需大规模训练数据集的情况下有效处理地震信号处理任务中的噪声标签。该方法通过将输入波形特征与标签表示分布在特征空间中对齐,从而纠正错误标签并减少其对训练过程的影响。我们展示了LaNCoR在真实微地震数据P波到时拾取任务中,基于两种基线模型和训练方法的表现。结果表明,LaNCoR在各项性能指标上最高可实现28.8%的提升。该方法为地震学及地球科学领域的模型训练提供了广阔前景。