Leveraging medical record information in the era of big data and machine learning comes with the caveat that data must be cleaned and deidentified. Facilitating data sharing and harmonization for multi-center collaborations are particularly difficult when protected health information (PHI) is contained or embedded in image meta-data. We propose a novel library in the Python framework, called PyLogik, to help alleviate this issue for ultrasound images, which are particularly challenging because of the frequent inclusion of PHI directly on the images. PyLogik processes the image volumes through a series of text detection/extraction, filtering, thresholding, morphological and contour comparisons. This methodology deidentifies the images, reduces file sizes, and prepares image volumes for applications in deep learning and data sharing. To evaluate its effectiveness in the identification of regions of interest (ROI), a random sample of 50 cardiac ultrasounds (echocardiograms) were processed through PyLogik, and the outputs were compared with the manual segmentations by an expert user. The Dice coefficient of the two approaches achieved an average value of 0.976. Next, an investigation was conducted to ascertain the degree of information compression achieved using the algorithm. Resultant data was found to be on average approximately 72% smaller after processing by PyLogik. Our results suggest that PyLogik is a viable methodology for ultrasound data cleaning and deidentification, determining ROI, and file compression which will facilitate efficient storage, use, and dissemination of ultrasound data.
翻译:在大数据和机器学习时代利用医疗记录信息,必然要求对数据进行清洗和去标识化。当受保护的健康信息(PHI)包含或嵌入图像元数据时,促进多中心协作的数据共享与协调尤为困难。针对超声图像(因其常直接在图像上嵌入PHI而极具挑战性),我们提出了一种名为PyLogik的新型Python库,以缓解这一问题。PyLogik通过一系列文本检测/提取、滤波、阈值分割、形态学与轮廓比较操作处理图像体素。该方法可对图像进行去标识化、减小文件体积,并使图像体素适用于深度学习与数据共享场景。为评估其在感兴趣区域(ROI)识别中的有效性,我们使用PyLogik处理了50例随机心脏超声(超声心动图)样本,并将输出结果与专家人工分割结果进行比较。两种方法的Dice系数平均值达到0.976。随后,本研究进一步探究了该算法可实现的信息压缩程度。经PyLogik处理后,数据体积平均减少约72%。结果表明,PyLogik是一种用于超声数据清洗与去标识化、ROI确定及文件压缩的可行方法,将有助于实现超声数据的高效存储、使用与传播。