Respiratory sound analysis is a crucial tool for screening asthma and other pulmonary pathologies, yet traditional auscultation remains subjective and experience-dependent. Our prior research established a CNN baseline using DenseNet201, which demonstrated high sensitivity in classifying respiratory sounds. In this work, we (i) adapt the Audio Spectrogram Transformer (AST) for respiratory sound analysis and (ii) evaluate a multimodal Vision-Language Model (VLM) that integrates spectrograms with structured patient metadata. AST is initialized from publicly available weights and fine-tuned on a medical dataset containing hundreds of recordings per diagnosis. The VLM experiment uses a compact Moondream-type model that processes spectrogram images alongside a structured text prompt (sex, age, recording site) to output a JSON-formatted diagnosis. Results indicate that AST achieves approximately 97% accuracy with an F1-score around 97% and ROC AUC of 0.98 for asthma detection, significantly outperforming both the internal CNN baseline and typical external benchmarks. The VLM reaches 86-87% accuracy, performing comparably to the CNN baseline while demonstrating the capability to integrate clinical context into the inference process. These results confirm the effectiveness of self-attention for acoustic screening and highlight the potential of multimodal architectures for holistic diagnostic tools.
翻译:呼吸音分析是筛查哮喘及其他肺部疾病的重要工具,但传统听诊方法仍具有主观性和经验依赖性。我们先前的研究使用DenseNet201建立了CNN基线模型,该模型在呼吸音分类中表现出高灵敏度。本研究(i)将音频谱图Transformer(AST)适配于呼吸音分析,并(ii)评估了一种融合谱图与结构化患者元数据的多模态视觉-语言模型(VLM)。AST基于公开权重初始化,并在包含每种诊断数百条录音的医学数据集上进行微调。VLM实验采用紧凑型Moondream架构模型,该模型同步处理谱图图像与结构化文本提示(性别、年龄、录音部位)以输出JSON格式的诊断结果。实验表明,AST在哮喘检测中达到约97%的准确率,F1分数约97%,ROC AUC为0.98,显著优于内部CNN基线及典型外部基准。VLM取得86-87%的准确率,其性能与CNN基线相当,同时展现出将临床背景融入推理过程的能力。这些结果证实了自注意力机制在声学筛查中的有效性,并凸显了多模态架构在构建整体性诊断工具方面的潜力。