Data sharing is crucial for open science and reproducible research, but the legal sharing of clinical data requires the removal of protected health information from electronic health records. This process, known as de-identification, is often achieved through the use of machine learning algorithms by many commercial and open-source systems. While these systems have shown compelling results on average, the variation in their performance across different demographic groups has not been thoroughly examined. In this work, we investigate the bias of de-identification systems on names in clinical notes via a large-scale empirical analysis. To achieve this, we create 16 name sets that vary along four demographic dimensions: gender, race, name popularity, and the decade of popularity. We insert these names into 100 manually curated clinical templates and evaluate the performance of nine public and private de-identification methods. Our findings reveal that there are statistically significant performance gaps along a majority of the demographic dimensions in most methods. We further illustrate that de-identification quality is affected by polysemy in names, gender context, and clinical note characteristics. To mitigate the identified gaps, we propose a simple and method-agnostic solution by fine-tuning de-identification methods with clinical context and diverse names. Overall, it is imperative to address the bias in existing methods immediately so that downstream stakeholders can build high-quality systems to serve all demographic parties fairly.
翻译:数据共享对于开放科学和可重复研究至关重要,但临床数据的合法共享需要从电子健康记录中移除受保护的健康信息。这一过程称为去标识化,通常通过许多商业和开源系统使用机器学习算法来实现。尽管这些系统平均表现出令人信服的结果,但它们在不同人口群体间的性能差异尚未得到彻底研究。在本工作中,我们通过大规模实证分析,调查了去标识化系统对临床笔记中姓名的偏差。为此,我们创建了16个姓名集,这些集合沿四个主要人口统计维度变化:性别、种族、姓名流行度以及流行年代。我们将这些姓名插入到100个手动筛选的临床模板中,并评估了九种公共和私有去标识化方法的性能。我们的发现表明,在大多数方法中,大部分人口统计维度上存在统计上显著的性能差距。我们进一步说明,去标识化质量受到姓名多义性、性别背景以及临床笔记特征的影响。为缓解已识别的差距,我们提出了一种简单且与模型无关的解决方案,即通过临床上下文和多样化姓名对去标识化方法进行微调。总体而言,迫切需立即解决现有方法中的偏差,以便下游利益相关者能构建高质量系统,公平地服务于所有人口群体。