Supervised learning techniques typically assume training data originates from the target population. Yet, in reality, dataset shift frequently arises, which, if not adequately taken into account, may decrease the performance of their predictors. In this work, we propose a novel and flexible framework called DetectShift that quantifies and tests for multiple dataset shifts, encompassing shifts in the distributions of $(X, Y)$, $X$, $Y$, $X|Y$, and $Y|X$. DetectShift equips practitioners with insights into data shifts, facilitating the adaptation or retraining of predictors using both source and target data. This proves extremely valuable when labeled samples in the target domain are limited. The framework utilizes test statistics with the same nature to quantify the magnitude of the various shifts, making results more interpretable. It is versatile, suitable for regression and classification tasks, and accommodates diverse data forms - tabular, text, or image. Experimental results demonstrate the effectiveness of DetectShift in detecting dataset shifts even in higher dimensions.
翻译:监督学习技术通常假设训练数据来自目标总体。然而,实际中数据集偏移频繁出现,若未能充分考量,可能会降低预测器的性能。本文提出一种新颖且灵活的框架DetectShift,用于量化并检验多种数据集偏移,涵盖$(X, Y)$、$X$、$Y$、$X|Y$及$Y|X$的分布偏移。DetectShift为从业者提供数据偏移的洞察,便于利用源域与目标域数据对预测器进行适配或重新训练。当目标域标注样本有限时,该方法极具价值。该框架采用同质性检验统计量量化各类偏移的幅度,使结果更具可解释性。它兼具通用性,适用于回归与分类任务,并能处理表格、文本或图像等多类数据形式。实验结果表明,即便在高维场景下,DetectShift也能有效检测数据集偏移。