Electrocardiogram (ECG) digitization-converting paper-based or scanned ECG images back into time-series signals-is critical for leveraging decades of legacy clinical data in modern deep learning applications. However, progress has been hindered by the lack of large-scale datasets providing both ECG images and their corresponding ground truth signals with comprehensive annotations. We introduce PTB-XL-Image-17K, a complete synthetic ECG image dataset comprising 17,271 high-quality 12-lead ECG images generated from the PTB-XL signal database. Our dataset uniquely provides five complementary data types per sample: (1) realistic ECG images with authentic grid patterns and annotations (50% with visible grid, 50% without), (2) pixel-level segmentation masks, (3) ground truth time-series signals, (4) bounding box annotations in YOLO format for both lead regions and lead name labels, and (5) comprehensive metadata including visual parameters and patient information. We present an open-source Python framework enabling customizable dataset generation with controllable parameters including paper speed (25/50 mm/s), voltage scale (5/10 mm/mV), sampling rate (500 Hz), grid appearance (4 colors), and waveform characteristics. The dataset achieves 100% generation success rate with an average processing time of 1.35 seconds per sample. PTB-XL-Image-17K addresses critical gaps in ECG digitization research by providing the first large-scale resource supporting the complete pipeline: lead detection, waveform segmentation, and signal extraction with full ground truth for rigorous evaluation. The dataset, generation framework, and documentation are publicly available at https://github.com/naqchoalimehdi/PTB-XL-Image-17K and https://doi.org/10.5281/zenodo.18197519.
翻译:心电图数字化——将纸质或扫描的心电图图像转换回时间序列信号——对于在现代深度学习应用中利用数十年的历史临床数据至关重要。然而,由于缺乏同时提供心电图图像及其带有全面标注的对应真实信号的大规模数据集,相关研究进展一直受阻。我们推出了PTB-XL-Image-17K,这是一个完整的合成心电图图像数据集,包含从PTB-XL信号数据库生成的17,271张高质量12导联心电图图像。我们的数据集为每个样本独特地提供了五种互补的数据类型:(1) 具有真实网格线和标注的现实心电图图像(50%带有可见网格,50%无网格),(2) 像素级分割掩码,(3) 真实时间序列信号,(4) YOLO格式的导联区域和导联名称标签的边界框标注,以及(5) 包括视觉参数和患者信息的全面元数据。我们提出了一个开源Python框架,支持通过可控参数进行可定制的数据集生成,这些参数包括走纸速度(25/50 mm/s)、电压标尺(5/10 mm/mV)、采样率(500 Hz)、网格外观(4种颜色)和波形特征。该数据集实现了100%的生成成功率,平均每个样本处理时间为1.35秒。PTB-XL-Image-17K通过提供首个支持完整流程(导联检测、波形分割和信号提取)并带有用于严格评估的完整真实标注的大规模资源,解决了心电图数字化研究中的关键空白。数据集、生成框架和文档已在 https://github.com/naqchoalimehdi/PTB-XL-Image-17K 和 https://doi.org/10.5281/zenodo.18197519 公开提供。