Most supervised learning methods assume that the data used in the training phase comes from the target population. However, in practice, one often faces dataset shift, which, if not adequately taken into account, may decrease the performance of their predictors. In this work, we propose a novel and flexible framework called DetectShift that enables quantification and testing of various types of dataset shifts, including shifts in the distributions of $(X, Y)$, $X$, $Y$, $X|Y$, and $Y|X$. DetectShift provides practitioners with insights about changes in their data, allowing them to leverage source and target data to retrain or adapt their predictors. That is particularly valuable in scenarios where labeled samples from the target domain are scarce. The framework utilizes test statistics with the same nature to quantify the magnitude of the various shifts, making results more interpretable. Moreover, it can be applied in both regression and classification tasks, as well as to different types of data such as tabular, text, and image data. Experimental results demonstrate the effectiveness of DetectShift in detecting dataset shifts even in higher dimensions. Our implementation for DetectShift can be found in https://github.com/felipemaiapolo/detectshift.
翻译:大多数监督学习方法假设训练阶段使用的数据来自目标总体。然而在实践中,人们经常面临数据集偏移问题,若未能充分考虑这一现象,预测模型的性能可能会下降。本文提出了一种新颖且灵活的框架DetectShift,该框架能够量化并检验多种类型的数据集偏移,包括$(X, Y)$、$X$、$Y$、$X|Y$和$Y|X$的分布偏移。DetectShift为实践者提供了数据变化的内在洞察,使其能够利用源数据和目标数据重新训练或调整预测模型。这在目标域标注样本稀缺的场景中尤为宝贵。该框架采用同质性的检验统计量来量化不同偏移的规模,从而使结果更具可解释性。此外,它既可应用于回归任务也可应用于分类任务,同时适用于表格、文本和图像等多种数据类型。实验结果表明,即使在更高维度中,DetectShift也能有效检测数据集偏移。DetectShift的实现代码可在https://github.com/felipemaiapolo/detectshift 获取。