基于AI驱动的声学嗓音生物标志物对良性喉部嗓音障碍进行持续性元音分层分类 (AI-Driven Acoustic Voice Biomarker-Based Hierarchical Classification of Benign Laryngeal Voice Disorders from Sustained Vowels)

Benign laryngeal voice disorders affect nearly one in five individuals and often manifest as dysphonia, while also serving as non-invasive indicators of broader physiological dysfunction. We introduce a clinically inspired hierarchical machine learning framework for automated classification of eight benign voice disorders alongside healthy controls, using acoustic features extracted from short, sustained vowel phonations. Experiments utilized 15,132 recordings from 1,261 speakers in the Saarbruecken Voice Database, covering vowels /a/, /i/, and /u/ at neutral, high, low, and gliding pitches. Mirroring clinical triage workflows, the framework operates in three sequential stages: Stage 1 performs binary screening of pathological versus non-pathological voices by integrating convolutional neural network-derived mel-spectrogram features with 21 interpretable acoustic biomarkers; Stage 2 stratifies voices into Healthy, Functional or Psychogenic, and Structural or Inflammatory groups using a cubic support vector machine; Stage 3 achieves fine-grained classification by incorporating probabilistic outputs from prior stages, improving discrimination of structural and inflammatory disorders relative to functional conditions. The proposed system consistently outperformed flat multi-class classifiers and pre-trained self-supervised models, including META HuBERT and Google HeAR, whose generic objectives are not optimized for sustained clinical phonation. By combining deep spectral representations with interpretable acoustic features, the framework enhances transparency and clinical alignment. These results highlight the potential of quantitative voice biomarkers as scalable, non-invasive tools for early screening, diagnostic triage, and longitudinal monitoring of vocal health.

翻译：良性喉部嗓音障碍影响近五分之一人群，常表现为发声障碍，同时也是更广泛生理功能障碍的非侵入性指标。我们提出一种临床启发的分层机器学习框架，利用从短时持续性元音发音中提取的声学特征，实现对八种良性嗓音障碍及健康对照的自动分类。实验采用萨尔布吕肯嗓音数据库中1,261名说话者的15,132条录音，涵盖中性、高、低及滑音四种音高下的元音/a/、/i/和/u/。该框架模拟临床分诊流程，按序执行三个阶段：第一阶段通过融合卷积神经网络提取的梅尔频谱图特征与21个可解释声学生物标志物，实现病理性与非病理性嗓音的二元筛查；第二阶段采用三次支持向量机将嗓音分层为健康组、功能性或心因性组、结构性或炎症性组；第三阶段通过整合前序阶段的概率输出实现细粒度分类，提升结构性/炎症性障碍相对于功能性疾病的鉴别能力。所提系统在各项指标上均优于平面多分类器及预训练自监督模型（包括META HuBERT和Google HeAR），这些通用模型的训练目标未针对持续性临床发音进行优化。通过结合深度频谱表征与可解释声学特征，该框架增强了透明度与临床契合度。研究结果凸显了定量嗓音生物标志物作为可扩展、非侵入性工具，在嗓音健康早期筛查、诊断分诊及纵向监测方面的应用潜力。