Noise plagues many numerical datasets, where the recorded values in the data may fail to match the true underlying values due to reasons including: erroneous sensors, data entry/processing mistakes, or imperfect human estimates. We consider general regression settings with covariates and a potentially corrupted response whose observed values may contain errors. By accounting for various uncertainties, we introduced veracity scores that distinguish between genuine errors and natural data fluctuations, conditioned on the available covariate information in the dataset. We propose a simple yet efficient filtering procedure for eliminating potential errors, and establish theoretical guarantees for our method. We also contribute a new error detection benchmark involving 5 regression datasets with real-world numerical errors (for which the true values are also known). In this benchmark and additional simulation studies, our method identifies incorrect values with better precision/recall than other approaches.
翻译:噪声困扰着众多数值数据集,其中记录的数据值可能因传感器误差、数据录入/处理错误或人类估计不完善等原因,与真实底层值不符。我们考虑包含协变量及潜在受污染响应的通用回归场景,其观测值可能包含误差。通过综合考虑各种不确定性,我们引入了可信度评分,该评分基于数据集中可用的协变量信息,能够区分真实误差与自然数据波动。我们提出了一种简单高效的过滤流程以消除潜在误差,并为该方法建立了理论保证。此外,我们还贡献了一个新的误差检测基准,涵盖5个包含真实世界数值误差的回归数据集(其中真实值已知)。在该基准及额外仿真研究中,我们的方法在识别错误数值方面的精确率/召回率均优于其他方法。