Pyclipse, a library for deidentification of free-text clinical notes

Automated deidentification of clinical text data is crucial due to the high cost of manual deidentification, which has been a barrier to sharing clinical text and the advancement of clinical natural language processing. However, creating effective automated deidentification tools faces several challenges, including issues in reproducibility due to differences in text processing, evaluation methods, and a lack of consistency across clinical domains and institutions. To address these challenges, we propose the pyclipse framework, a unified and configurable evaluation procedure to streamline the comparison of deidentification algorithms. Pyclipse serves as a single interface for running open-source deidentification algorithms on local clinical data, allowing for context-specific evaluation. To demonstrate the utility of pyclipse, we compare six deidentification algorithms across four public and two private clinical text datasets. We find that algorithm performance consistently falls short of the results reported in the original papers, even when evaluated on the same benchmark dataset. These discrepancies highlight the complexity of accurately assessing and comparing deidentification algorithms, emphasizing the need for a reproducible, adjustable, and extensible framework like pyclipse. Our framework lays the foundation for a unified approach to evaluate and improve deidentification tools, ultimately enhancing patient protection in clinical natural language processing.

翻译：临床文本数据的自动化去标识化至关重要，因为人工去标识化成本高昂，这已成为共享临床文本和推动临床自然语言处理发展的主要障碍。然而，构建有效的自动化去标识化工具面临诸多挑战，包括文本处理差异导致的可重复性问题、评估方法差异以及跨临床领域和机构缺乏一致性。为解决这些问题，我们提出pyclipse框架——一种统一且可配置的评估流程，以简化去标识化算法的比较。Pyclipse作为单一接口，可在本地临床数据上运行开源去标识化算法，支持特定情境下的评估。为展示pyclipse的实用性，我们对比了六种去标识化算法在四个公开数据集和两个私有临床文本数据集上的表现。研究发现，即便在相同基准数据集上评估，各算法性能始终低于原始论文报告的结果。这些差异凸显了准确评估和比较去标识化算法的复杂性，进一步验证了像pyclipse这类可重复、可调整、可扩展框架的必要性。本框架为统一评估和改进去标识化工具奠定了基础，最终将推动临床自然语言处理中患者隐私保护的增强。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

分布外泛化(Out-Of-Distribution Generalization) 综述论文，22页pdf240篇文献

专知会员服务

64+阅读 · 2021年9月2日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日