Data sharing is crucial for open science and reproducible research, but the legal sharing of clinical data requires the removal of protected health information from electronic health records. This process, known as de-identification, is often achieved through the use of machine learning algorithms by many commercial and open-source systems. While these systems have shown compelling results on average, the variation in their performance across different demographic groups has not been thoroughly examined. In this work, we investigate the bias of de-identification systems on names in clinical notes via a large-scale empirical analysis. To achieve this, we create 16 name sets that vary along four demographic dimensions: gender, race, name popularity, and the decade of popularity. We insert these names into 100 manually curated clinical templates and evaluate the performance of nine public and private de-identification methods. Our findings reveal that there are statistically significant performance gaps along a majority of the demographic dimensions in most methods. We further illustrate that de-identification quality is affected by polysemy in names, gender context, and clinical note characteristics. To mitigate the identified gaps, we propose a simple and method-agnostic solution by fine-tuning de-identification methods with clinical context and diverse names. Overall, it is imperative to address the bias in existing methods immediately so that downstream stakeholders can build high-quality systems to serve all demographic parties fairly.
翻译:数据共享对于开放科学和可重复研究至关重要,但临床数据的合法共享需要从电子健康记录中移除受保护的健康信息。这一过程被称为去标识化,通常通过使用机器学习算法在众多商业和开源系统中实现。尽管这些系统平均表现出令人信服的结果,但它们在跨不同人口群体时的性能差异尚未得到彻底研究。在本工作中,我们通过大规模实证分析,研究了去标识化系统在临床记录姓名上的偏见。为此,我们创建了16个姓名集,这些姓名集沿四个维度变化:性别、种族、姓名流行度及流行年代。我们将这些姓名插入100个手工整理的临床模板中,并评估了九种公开和私有去标识化方法的性能。我们的发现表明,在大多数方法中,多数人口统计维度上存在统计上显著的性能差距。我们进一步阐明,去标识化质量受到姓名多义性、性别上下文及临床记录特征的影响。为缓解已识别的差距,我们提出了一种简单且与方法无关的解决方案,即通过结合临床上下文和多样化姓名对去标识化方法进行微调。总体而言,立即解决现有方法中的偏见至关重要,以便下游利益相关者能够构建高质量系统,公平地服务于所有人口群体。