Privacy is a human right that sustains patient-provider trust. Clinical notes capture a patient's private vulnerability and individuality, which are used for care coordination and research. Under HIPAA Safe Harbor, these notes are de-identified to protect patient privacy. However, Safe Harbor was designed for an era of categorical tabular data, focusing on the removal of explicit identifiers while ignoring the latent information found in correlations between identity and quasi-identifiers, which can be captured by modern LLMs. We first formalize these correlations using a causal graph, then validate it empirically through individual re-identification of patients from scrubbed notes. The paradox of de-identification is further shown through a diagnosis ablation: even when all other information is removed, the model can predict the patient's neighborhood based on diagnosis alone. This position paper raises the question of how we can act as a community to uphold patient-provider trust when de-identification is inherently imperfect. We aim to raise awareness and discuss actionable recommendations.
翻译:隐私权是维系医患信任的基本人权。临床记录承载着患者的隐私脆弱性与个体独特性,这些记录被用于医疗协调与科学研究。根据HIPAA安全港准则,这些记录需经过去识别化处理以保护患者隐私。然而,安全港准则设计于分类表格数据时代,其关注点在于移除显式标识符,却忽视了身份标识与准标识符之间关联所蕴含的潜在信息——这些信息能够被现代大语言模型所捕捉。我们首先通过因果图形式化表征这些关联,继而通过对脱敏记录进行个体重识别的实证研究加以验证。诊断消融实验进一步揭示了去识别化的悖论:即使移除所有其他信息,模型仅凭诊断记录即可预测患者所在社区。本立场文件提出核心议题:当去识别化技术存在固有缺陷时,学术共同体应如何协同行动以维护医患信任。我们旨在提升学界认知并探讨可实施的改进方案。