(Predictable) Performance Bias in Unsupervised Anomaly Detection

Background: With the ever-increasing amount of medical imaging data, the demand for algorithms to assist clinicians has amplified. Unsupervised anomaly detection (UAD) models promise to aid in the crucial first step of disease detection. While previous studies have thoroughly explored fairness in supervised models in healthcare, for UAD, this has so far been unexplored. Methods: In this study, we evaluated how dataset composition regarding subgroups manifests in disparate performance of UAD models along multiple protected variables on three large-scale publicly available chest X-ray datasets. Our experiments were validated using two state-of-the-art UAD models for medical images. Finally, we introduced a novel subgroup-AUROC (sAUROC) metric, which aids in quantifying fairness in machine learning. Findings: Our experiments revealed empirical "fairness laws" (similar to "scaling laws" for Transformers) for training-dataset composition: Linear relationships between anomaly detection performance within a subpopulation and its representation in the training data. Our study further revealed performance disparities, even in the case of balanced training data, and compound effects that exacerbate the drop in performance for subjects associated with multiple adversely affected groups. Interpretation: Our study quantified the disparate performance of UAD models against certain demographic subgroups. Importantly, we showed that this unfairness cannot be mitigated by balanced representation alone. Instead, the representation of some subgroups seems harder to learn by UAD models than that of others. The empirical fairness laws discovered in our study make disparate performance in UAD models easier to estimate and aid in determining the most desirable dataset composition.

翻译：背景：随着医学影像数据量的持续增长，临床对辅助算法的需求日益增加。无监督异常检测（UAD）模型有望在疾病检测这一关键初始步骤中发挥作用。尽管已有研究深入探讨了医疗领域有监督模型的公平性问题，但针对UAD模型的相关研究至今仍属空白。方法：本研究基于三个大规模公开胸部X光数据集，系统评估了数据集在受保护变量上的子群组成对UAD模型性能差异的影响。我们采用两种最新医疗影像UAD模型进行实验验证，并提出了新型子群AUROC（sAUROC）指标以量化机器学习中的公平性。发现：实验揭示了训练数据集构成的"公平性定律"（类似于Transformer的"缩放定律"）：子群异常检测性能与其在训练数据中的表征呈线性相关。研究进一步发现，即使在训练数据均衡的情况下仍存在性能差异，且多个劣势子群叠加效应会加剧模型性能下降。解释：本研究量化了UAD模型在特定人口统计子群上的性能差异。重要发现是，单纯通过数据均衡化无法消除这种不公平性——某些子群的特征表征似乎比其他子群更难被UAD模型学习。本研究所发现的公平性经验定律有助于更便捷地评估UAD模型的性能差异，并为确定最优数据集构成提供指导。