Automatic dataset shift identification to support root cause analysis of AI performance drift

Shifts in data distribution can substantially harm the performance of clinical AI models. Hence, various methods have been developed to detect the presence of such shifts at deployment time. However, root causes of dataset shifts are varied, and the choice of shift mitigation strategies is highly dependent on the precise type of shift encountered at test time. As such, detecting test-time dataset shift is not sufficient: precisely identifying which type of shift has occurred is critical. In this work, we propose the first unsupervised dataset shift identification framework, effectively distinguishing between prevalence shift (caused by a change in the label distribution), covariate shift (caused by a change in input characteristics) and mixed shifts (simultaneous prevalence and covariate shifts). We discuss the importance of self-supervised encoders for detecting subtle covariate shifts and propose a novel shift detector leveraging both self-supervised encoders and task model outputs for improved shift detection. We report promising results for the proposed shift identification framework across three different imaging modalities (chest radiography, digital mammography, and retinal fundus images) on five types of real-world dataset shifts, using four large publicly available datasets.

翻译：数据分布的偏移会严重损害临床人工智能模型的性能。因此，已开发出多种方法用于在部署时检测此类偏移的存在。然而，数据集偏移的根源多种多样，偏移缓解策略的选择高度依赖于测试时遇到的具体偏移类型。因此，仅检测测试时的数据集偏移是不够的：精确识别已发生的偏移类型至关重要。在本工作中，我们提出了首个无监督数据集偏移识别框架，能有效区分患病率偏移（由标签分布变化引起）、协变量偏移（由输入特征变化引起）以及混合偏移（患病率偏移和协变量偏移同时发生）。我们讨论了自监督编码器对于检测细微协变量偏移的重要性，并提出了一种新颖的偏移检测器，该检测器同时利用自监督编码器和任务模型输出，以改进偏移检测。我们在四种大型公开可用数据集上，针对五种真实世界数据集偏移类型，在三种不同的成像模态（胸部X光摄影、数字乳腺X线摄影和视网膜眼底图像）上报告了所提出的偏移识别框架的令人鼓舞的结果。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Nat. Biotechnol. | 机器学习为生物库驱动的药物发现提供动力

专知会员服务

11+阅读 · 2022年9月12日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日