Automated deidentification of clinical text data is crucial due to the high cost of manual deidentification, which has been a barrier to sharing clinical text and the advancement of clinical natural language processing. However, creating effective automated deidentification tools faces several challenges, including issues in reproducibility due to differences in text processing, evaluation methods, and a lack of consistency across clinical domains and institutions. To address these challenges, we propose the pyclipse framework, a unified and configurable evaluation procedure to streamline the comparison of deidentification algorithms. Pyclipse serves as a single interface for running open-source deidentification algorithms on local clinical data, allowing for context-specific evaluation. To demonstrate the utility of pyclipse, we compare six deidentification algorithms across four public and two private clinical text datasets. We find that algorithm performance consistently falls short of the results reported in the original papers, even when evaluated on the same benchmark dataset. These discrepancies highlight the complexity of accurately assessing and comparing deidentification algorithms, emphasizing the need for a reproducible, adjustable, and extensible framework like pyclipse. Our framework lays the foundation for a unified approach to evaluate and improve deidentification tools, ultimately enhancing patient protection in clinical natural language processing.
翻译:临床文本数据的自动化去标识化至关重要,因为人工去标识化成本高昂,这已成为共享临床文本和推动临床自然语言处理发展的主要障碍。然而,构建有效的自动化去标识化工具面临诸多挑战,包括文本处理差异导致的可重复性问题、评估方法差异以及跨临床领域和机构缺乏一致性。为解决这些问题,我们提出pyclipse框架——一种统一且可配置的评估流程,以简化去标识化算法的比较。Pyclipse作为单一接口,可在本地临床数据上运行开源去标识化算法,支持特定情境下的评估。为展示pyclipse的实用性,我们对比了六种去标识化算法在四个公开数据集和两个私有临床文本数据集上的表现。研究发现,即便在相同基准数据集上评估,各算法性能始终低于原始论文报告的结果。这些差异凸显了准确评估和比较去标识化算法的复杂性,进一步验证了像pyclipse这类可重复、可调整、可扩展框架的必要性。本框架为统一评估和改进去标识化工具奠定了基础,最终将推动临床自然语言处理中患者隐私保护的增强。