High-quality training data is essential for the success of machine learning models. However, real-world datasets often contain mixed types of errors arising from systematic flaws in data preparation pipelines, including label errors, feature errors, and spurious correlations. Effective debugging of training data requires both detecting erroneous samples and identifying their specific error types to enable targeted repair, yet existing data cleaning and attribution methods fail to adequately address this dual requirement. In this paper, we propose DeMix, a novel framework that simultaneously diagnoses erroneous samples and their error types. Our key insight is that different error types produce distinct patterns on model behavior. DeMix captures such error-specific patterns by influence vectors that characterize how each training sample affects model predictions across all validation samples. We formulate training data debugging as a multi-label classification problem where a classifier is developed to predict error types directly from influence vectors. We further introduce an intervention-based learning strategy that guides the classifier to capture invariant rationales specific to each error type, ensuring the learned classifier generalizes effectively. Empirical evaluations on 11 tasks across tabular data prediction, recommendation systems, and LLM alignment demonstrate that DeMix significantly outperforms state-of-the-art approaches, achieving a 22.61% improvement in data debugging F1-score and a 9.32% gain in task model performance after data repair. Code is available at: https://github.com/SJTU-DMTai/DeMix.
翻译:高质量训练数据对于机器学习模型的成功至关重要。然而,现实世界的数据集通常包含由数据准备流程中的系统性缺陷引起的混合错误类型,包括标签错误、特征错误和虚假相关性。有效的训练数据调试既需要检测错误样本,也需要识别其特定错误类型以实现针对性修复,但现有的数据清洗和归因方法未能充分满足这一双重需求。在本文中,我们提出DeMix,一种同时诊断错误样本及其错误类型的新颖框架。我们的关键洞察在于,不同错误类型会在模型行为上产生不同的模式。DeMix通过影响向量捕捉此类错误特定模式,该向量刻画了每个训练样本如何影响模型在所有验证样本上的预测。我们将训练数据调试形式化为一个多标签分类问题,其中开发了一个分类器直接从影响向量预测错误类型。我们进一步引入一种基于干预的学习策略,引导分类器捕捉每种错误类型特有的不变理由,确保学习到的分类器有效泛化。在表格数据预测、推荐系统和LLM对齐等11项任务上的实证评估表明,DeMix显著优于现有最先进方法,在数据调试F1分数上实现了22.61%的提升,在数据修复后的任务模型性能上获得了9.32%的增益。代码可在https://github.com/SJTU-DMTai/DeMix获取。