The quality of training data is critical to the performance of machine learning models. In this paper, the Error Sensitivity Profile (ESP) is proposed. It quantifies the sensitivity of model performance to errors in a single feature or in multiple features. By leveraging ESP, data-cleaning efforts can be prioritized based on error types and features most likely to affect model performance. To support the computation of this metric, an integrated suite of tools, called \dirty, is created. We conduct an extensive experimental study on two widely used datasets using 14 classification models, revealing that performance degradation is not always predictable from simple correlations with the target variable.
翻译:训练数据的质量对机器学习模型的性能至关重要。本文提出误差敏感度分布(ESP)方法,用于量化模型性能对单个特征或多个特征中误差的敏感程度。通过利用ESP,可根据最可能影响模型性能的误差类型和特征,优先安排数据清洗工作。为支持该指标的计算,我们创建了一套名为 \dirty 的集成工具。我们使用14种分类模型对两个广泛使用的数据集进行了广泛的实验研究,结果表明性能退化并非总是能通过与目标变量的简单相关性进行预测。