A Synthetic Electrocardiogram (ECG) Image Generation Toolbox to Facilitate Deep Learning-Based Scanned ECG Digitization

Access to medical data is often limited as it contains protected health information (PHI). There are privacy concerns regarding using records containing personally identifiable information. Recent advancements have been made in applying deep learning-based algorithms for clinical diagnosis and decision-making. However, deep learning models are data-greedy, whereas the availability of medical datasets for training and evaluating these models is relatively limited. Data augmentation with so-called \textit{digital twins} is an emerging technique to address this need. This paper presents a novel approach for generating synthetic electrocardiogram (ECG) images with realistic artifacts from time-series data for use in developing algorithms for digitization of ECG images. Synthetic data is generated in a privacy-preserving manner by generating distortionless ECG images on standard ECG paper background. Next, various distortions, including handwritten text artifacts, wrinkles, creases, and perspective transforms are applied to the ECG images. The artifacts are generated synthetically, without personally identifiable information. As a use case, we generated a large ECG image dataset of 21,801 records from the PhysioNet PTB-XL dataset, with 12 lead ECG time-series data from 18,869 patients. A deep ECG image digitization model was developed and trained on the synthetic dataset, and was employed to convert the synthetic images to time-series data for evaluation. The signal-to-noise ratio (SNR) was calculated to assess the image digitization quality vs the ground truth ECG time-series. The results show an average signal recovery SNR of 27$\pm$2.8\,dB, demonstrating the significance of the proposed synthetic ECG image dataset for training deep learning models.

翻译：医疗数据的获取通常受到限制，因其包含受保护的健康信息。使用含个人身份信息的记录存在隐私担忧。近年来，基于深度学习的算法在临床诊断与决策中的应用取得了进展。然而，深度学习模型对数据的需求量大，而用于训练和评估这些模型的医疗数据集相对有限。利用所谓“数字孪生”进行数据增强是一种新兴技术，以应对这一需求。本文提出了一种新方法，可从时间序列数据生成带有真实伪影的合成心电图图像，用于开发心电图图像数字化算法。合成数据以隐私保护方式生成，即在标准心电图纸背景上生成无畸变的心电图图像。随后，对心电图图像应用多种畸变，包括手写文本伪影、褶皱、折痕和透视变换。这些伪影以合成方式生成，不含个人身份信息。作为应用案例，我们从PhysioNet PTB-XL数据集中生成了一个包含21,801条记录的大型心电图图像数据集，该数据集涵盖来自18,869名患者的12导联心电图时间序列数据。我们开发了一个深度心电图图像数字化模型，并在合成数据集上进行训练，用于将合成图像转换为时间序列数据以进行评估。计算信噪比以评估图像数字化质量与真实心电图时间序列的对比。结果显示，平均信号恢复信噪比为27±2.8 dB，证明了所提出的合成心电图图像数据集对训练深度学习模型的重要性。