Data shift is a phenomenon present in many real-world applications, and while there are multiple methods attempting to detect shifts, the task of localizing and correcting the features originating such shifts has not been studied in depth. Feature shifts can occur in many datasets, including in multi-sensor data, where some sensors are malfunctioning, or in tabular and structured data, including biomedical, financial, and survey data, where faulty standardization and data processing pipelines can lead to erroneous features. In this work, we explore using the principles of adversarial learning, where the information from several discriminators trained to distinguish between two distributions is used to both detect the corrupted features and fix them in order to remove the distribution shift between datasets. We show that mainstream supervised classifiers, such as random forest or gradient boosting trees, combined with simple iterative heuristics, can localize and correct feature shifts, outperforming current statistical and neural network-based techniques. The code is available at https://github.com/AI-sandbox/DataFix.
翻译:数据偏移是诸多现实应用中的普遍现象,尽管已有多种方法尝试检测偏移,但针对引发偏移的特征定位与修正问题尚未得到深入研究。特征偏移可能出现在多种数据集中,包括多传感器数据(部分传感器发生故障)以及表格化结构化数据(如生物医学、金融和调查数据中的错误标准化与数据处理流程导致特征错误)。本研究探索采用对抗学习原理,通过训练多个判别器来区分两个分布,利用其信息实现受损特征的检测与修复,从而消除数据集间的分布偏移。研究表明,随机森林或梯度提升树等主流监督分类器结合简单迭代启发式方法,能够定位并修正特征偏移,其性能优于现有统计方法与神经网络技术。代码已开源于 https://github.com/AI-sandbox/DataFix。