Diffusion-Based Heart Sound Generation: Evaluation with Physiological Signal Metrics, Classifiers, and Expert Listening

Publicly available phonocardiogram (PCG) datasets remain limited in size and pathological diversity, constraining both auscultation training and the generalisation of automated heart-sound classifiers. A class-conditional diffusion model for PCG generation is developed in the log-mel domain and synthetic fidelity is assessed using complementary (i) physiology-inspired plausibility metrics, (ii) downstream label-consistency evaluation, and (iii) expert listening. Experiments use the Phy-sioNet/Computing in Cardiology Challenge 2016 dataset (3240 recordings) with recording-level splits. After preprocessing and quality control, 16,749 non-overlapping 4 s clips are mapped to a normalised 1 x 128 x 128 log-mel representation to train a conditional 2D U-Net denoiser with classifier-free guidance. Signal-level plausibility is quantified on reconstructed waveforms using three lightweight metrics: an envelope-autocorrelation rhythm score, an amplitude-based explosion score, and the dominant cycle lag. Synthetic clips preserve similar dominant cycle durations but exhibit reduced envelope periodicity and increased transient burstiness relative to real clips. For downstream evaluation, a ResNet-50 classifier achieves 92.24% accuracy on the held-out real test set and 82.8% accuracy on class-balanced synthetic batches, indicating that generated signals retain discriminative structure relevant to normal/abnormal classification. In a pilot expert listening study (60 clips, two clinicians), most synthetic clips are judged as heart-sound-like, while abnormality sensitivity is low for both real and synthetic 4 s excerpts. Overall, the results provide a practical baseline for diffusion-based PCG generation while highlighting remaining challenges in retaining abnormal acoustic cues and reducing reconstruction-induced artefacts.

翻译：公开可用的心音图（PCG）数据集在规模与病理多样性方面仍十分有限，这制约了听诊训练及自动化心音分类器的泛化能力。本研究提出一种在心音图对数梅尔谱域上开发的类别条件扩散模型，并通过三种互补性评估手段评估合成信号的保真度：（i）基于生理学合理性的指标；（ii）下游类别一致性评估；（iii）专家听诊。实验采用PhysioNet/Computing in Cardiology Challenge 2016数据集（3240条记录），按录音级进行分割。经预处理与质量控制后，将16,749个无重叠的4秒片段映射为标准化1×128×128对数梅尔谱表示，用于训练带无分类器引导的条件二维U-Net去噪器。通过三种轻量级指标在重建波形上量化信号级合理性：包络自相关节律评分、基于幅度的爆发评分及主导周期滞后值。合成片段虽保持相似的主导周期时长，但相较于真实片段，其包络周期性减弱且瞬态爆发性增强。在下游评估中，ResNet-50分类器在保留真实测试集上达到92.24%准确率，在类别平衡的合成数据集上达到82.8%准确率，表明生成信号保留了与正常/异常分类相关的判别性结构。在初步专家听诊研究（60个片段、两名临床医生）中，多数合成片段被判定为类似心音，但医生对真实与合成的4秒片段中的异常特征敏感度均较低。总体而言，本研究为基于扩散过程的心音生成提供了实用基线，同时揭示了在保留异常声学线索及减少重建伪影方面仍需解决的挑战。