A unified framework for dataset shift diagnostics

Supervised learning techniques typically assume training data originates from the target population. Yet, in reality, dataset shift frequently arises, which, if not adequately taken into account, may decrease the performance of their predictors. In this work, we propose a novel and flexible framework called DetectShift that quantifies and tests for multiple dataset shifts, encompassing shifts in the distributions of $(X, Y)$, $X$, $Y$, $X|Y$, and $Y|X$. DetectShift equips practitioners with insights into data shifts, facilitating the adaptation or retraining of predictors using both source and target data. This proves extremely valuable when labeled samples in the target domain are limited. The framework utilizes test statistics with the same nature to quantify the magnitude of the various shifts, making results more interpretable. It is versatile, suitable for regression and classification tasks, and accommodates diverse data forms - tabular, text, or image. Experimental results demonstrate the effectiveness of DetectShift in detecting dataset shifts even in higher dimensions.

翻译：监督学习技术通常假设训练数据来自目标总体。然而，实际中数据集偏移频繁出现，若未能充分考量，可能会降低预测器的性能。本文提出一种新颖且灵活的框架DetectShift，用于量化并检验多种数据集偏移，涵盖$(X, Y)$、$X$、$Y$、$X|Y$及$Y|X$的分布偏移。DetectShift为从业者提供数据偏移的洞察，便于利用源域与目标域数据对预测器进行适配或重新训练。当目标域标注样本有限时，该方法极具价值。该框架采用同质性检验统计量量化各类偏移的幅度，使结果更具可解释性。它兼具通用性，适用于回归与分类任务，并能处理表格、文本或图像等多类数据形式。实验结果表明，即便在高维场景下，DetectShift也能有效检测数据集偏移。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日