Leveraging medical record information in the era of big data and machine learning comes with the caveat that data must be cleaned and de-identified. Facilitating data sharing and harmonization for multi-center collaborations are particularly difficult when protected health information (PHI) is contained or embedded in image meta-data. We propose a novel library in the Python framework, called PyLogik, to help alleviate this issue for ultrasound images, which are particularly challenging because of the frequent inclusion of PHI directly on the images. PyLogik processes the image volumes through a series of text detection/extraction, filtering, thresholding, morphological and contour comparisons. This methodology de-identifies the images, reduces file sizes, and prepares image volumes for applications in deep learning and data sharing. To evaluate its effectiveness in processing ultrasound data, a random sample of 50 cardiac ultrasounds (echocardiograms) were processed through PyLogik, and the outputs were compared with the manual segmentations by an expert user. The Dice coefficient of the two approaches achieved an average value of 0.976. Next, an investigation was conducted to ascertain the degree of information compression achieved using the algorithm. Resultant data was found to be on average ~72% smaller after processing by PyLogik. Our results suggest that PyLogik is a viable methodology for data cleaning and de-identification, determining ROI, and file compression which will facilitate efficient storage, use, and dissemination of ultrasound data. Variants of the pipeline have also been created for use with other medical imaging data types.
翻译:在大数据与机器学习时代,利用医疗记录信息的前提是数据必须经过清洗和去标识。当受保护健康信息(PHI)包含或嵌入在图像元数据中时,促进多中心协作的数据共享与协调尤为困难。我们提出一个基于Python框架的新型库PyLogik,旨在帮助缓解超声图像中的这一问题,而超声图像因其常将PHI直接包含在图像上面临特殊挑战。PyLogik通过一系列文本检测/提取、滤波、阈值处理、形态学与轮廓比较操作来处理图像体数据。该方法可对图像进行去标识、减小文件大小,并为深度学习应用和数据共享准备图像体数据。为评估其处理超声数据的有效性,我们随机抽取50例心脏超声(超声心动图)样本,经PyLogik处理后,将输出结果与专家用户的逐一手动分割进行比较,两种方法的Dice系数平均值达到0.976。随后,我们进行了信息压缩程度的评估实验,发现经PyLogik处理后,数据平均减小约72%。实验结果表明,PyLogik是一种可行的数据清洗与去标识、感兴趣区(ROI)确定以及文件压缩方法,有助于实现超声数据的高效存储、使用与传播。此外,该流程的变体也已创建,可用于其他医学影像数据类型。