Recent advancements in AI have democratized its deployment as a healthcare assistant. While pretrained models from large-scale visual and audio datasets have demonstrably generalized to this task, surprisingly, no studies have explored pretrained speech models, which, as human-originated sounds, intuitively would share closer resemblance to lung sounds. This paper explores the efficacy of pretrained speech models for respiratory sound classification. We find that there is a characterization gap between speech and lung sound samples, and to bridge this gap, data augmentation is essential. However, the most widely used augmentation technique for audio and speech, SpecAugment, requires 2-dimensional spectrogram format and cannot be applied to models pretrained on speech waveforms. To address this, we propose RepAugment, an input-agnostic representation-level augmentation technique that outperforms SpecAugment, but is also suitable for respiratory sound classification with waveform pretrained models. Experimental results show that our approach outperforms the SpecAugment, demonstrating a substantial improvement in the accuracy of minority disease classes, reaching up to 7.14%.
翻译:近年来,人工智能的进步已使其作为医疗辅助工具得到广泛应用。尽管基于大规模视觉和音频数据集的预训练模型已证明可泛化至该任务,但令人惊讶的是,尚无研究探索预训练语音模型——作为人类发声的音频,这类模型直觉上应与肺部声音具有更高相似性。本文探究了预训练语音模型在呼吸音分类中的有效性。我们发现语音与肺音样本之间存在表征差异,而数据增强是弥合这一差异的关键。然而,音频和语音领域最广泛使用的增强技术SpecAugment需依赖二维语谱图输入格式,无法应用于基于语音波形预训练的模型。为此,我们提出RepAugment——一种输入无关的表示级数据增强技术,其性能不仅优于SpecAugment,还可适配基于波形的预训练模型进行呼吸音分类。实验结果表明,我们的方法全面超越SpecAugment,尤其将少数类疾病的分类准确率显著提升达7.14%。