A unified framework for dataset shift diagnostics

Most supervised learning methods assume that the data used in the training phase comes from the target population. However, in practice, one often faces dataset shift, which, if not adequately taken into account, may decrease the performance of their predictors. In this work, we propose a novel and flexible framework called DetectShift that enables quantification and testing of various types of dataset shifts, including shifts in the distributions of $(X, Y)$, $X$, $Y$, $X|Y$, and $Y|X$. DetectShift provides practitioners with insights about changes in their data, allowing them to leverage source and target data to retrain or adapt their predictors. That is particularly valuable in scenarios where labeled samples from the target domain are scarce. The framework utilizes test statistics with the same nature to quantify the magnitude of the various shifts, making results more interpretable. Moreover, it can be applied in both regression and classification tasks, as well as to different types of data such as tabular, text, and image data. Experimental results demonstrate the effectiveness of DetectShift in detecting dataset shifts even in higher dimensions. Our implementation for DetectShift can be found in https://github.com/felipemaiapolo/detectshift.

翻译：大多数监督学习方法假设训练阶段使用的数据来自目标总体。然而在实践中，人们经常面临数据集偏移问题，若未能充分考虑这一现象，预测模型的性能可能会下降。本文提出了一种新颖且灵活的框架DetectShift，该框架能够量化并检验多种类型的数据集偏移，包括$(X, Y)$、$X$、$Y$、$X|Y$和$Y|X$的分布偏移。DetectShift为实践者提供了数据变化的内在洞察，使其能够利用源数据和目标数据重新训练或调整预测模型。这在目标域标注样本稀缺的场景中尤为宝贵。该框架采用同质性的检验统计量来量化不同偏移的规模，从而使结果更具可解释性。此外，它既可应用于回归任务也可应用于分类任务，同时适用于表格、文本和图像等多种数据类型。实验结果表明，即使在更高维度中，DetectShift也能有效检测数据集偏移。DetectShift的实现代码可在https://github.com/felipemaiapolo/detectshift 获取。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

“CVPR 2021 接受论文列表 1663篇论文都在这了

专知会员服务

32+阅读 · 2021年6月12日

不可错过！UIUC最新《统计强化学习》课程！

专知会员服务

55+阅读 · 2020年9月7日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日