Leveraging medical record information in the era of big data and machine learning comes with the caveat that data must be cleaned and deidentified. Facilitating data sharing and harmonization for multi-center collaborations are particularly difficult when protected health information (PHI) is contained or embedded in image meta-data. We propose a novel library in the Python framework, called PyLogik, to help alleviate this issue for ultrasound images, which are particularly challenging because of the frequent inclusion of PHI directly on the images. PyLogik processes the image volumes through a series of text detection/extraction, filtering, thresholding, morphological and contour comparisons. This methodology deidentifies the images, reduces file sizes, and prepares image volumes for applications in deep learning and data sharing. To evaluate its effectiveness in the identification of regions of interest (ROI), a random sample of 50 cardiac ultrasounds (echocardiograms) were processed through PyLogik, and the outputs were compared with the manual segmentations by an expert user. The Dice coefficient of the two approaches achieved an average value of 0.976. Next, an investigation was conducted to ascertain the degree of information compression achieved using the algorithm. Resultant data was found to be on average approximately 72% smaller after processing by PyLogik. Our results suggest that PyLogik is a viable methodology for ultrasound data cleaning and deidentification, determining ROI, and file compression which will facilitate efficient storage, use, and dissemination of ultrasound data.
翻译:在大数据与机器学习时代利用医疗记录信息的前提是数据必须经过清洗和去标识化。当受保护的健康信息(PHI)包含或嵌入图像元数据时,促进多中心合作中的数据共享与协调尤为困难。我们提出一种基于Python框架的新颖函数库PyLogik,旨在缓解超声图像中因PHI常直接包含于图像本身所带来的特殊挑战。PyLogik通过文本检测/提取、滤波、阈值处理、形态学与轮廓比较等一系列流程处理图像体积数据。该方法可去除图像标识、减小文件体积,并为深度学习应用与数据共享准备图像体积数据。为评估其对感兴趣区域(ROI)识别的有效性,研究人员随机选取50例心脏超声(超声心动图)样本,经PyLogik处理后,将输出结果与专家手动分割结果进行对比。两种方法的Dice系数平均值达0.976。随后,研究进一步验证了算法实现的信息压缩程度。经PyLogik处理后,数据体积平均缩减约72%。研究结果表明,PyLogik在超声数据清洗与去标识化、ROI确定及文件压缩方面具有可行性,将有效促进超声数据的高效存储、利用与传播。