Approximately 1.2% of the world's population has impaired voice production. As a result, automatic dysphonic voice detection has attracted considerable academic and clinical interest. However, existing methods for automated voice assessment often fail to generalize outside the training conditions or to other related applications. In this paper, we propose a deep learning framework for generating acoustic feature embeddings sensitive to vocal quality and robust across different corpora. A contrastive loss is combined with a classification loss to train our deep learning model jointly. Data warping methods are used on input voice samples to improve the robustness of our method. Empirical results demonstrate that our method not only achieves high in-corpus and cross-corpus classification accuracy but also generates good embeddings sensitive to voice quality and robust across different corpora. We also compare our results against three baseline methods on clean and three variations of deteriorated in-corpus and cross-corpus datasets and demonstrate that the proposed model consistently outperforms the baseline methods.
翻译:全球约有1.2%的人口存在发声障碍,因此自动嘶哑语音检测引起了学术界和临床领域的广泛关注。然而,现有的自动语音评估方法往往难以在训练条件之外或面向其他相关应用时实现泛化。本文提出一种深度学习框架,用于生成对嗓音质量敏感且跨语料库具有鲁棒性的声学特征嵌入。我们将对比损失与分类损失相结合,以联合训练深度学习模型。通过在输入语音样本上采用数据扭曲方法,提升了模型的鲁棒性。实验结果表明,我们的方法不仅在语料库内和跨语料库分类中均取得了高准确率,还生成了对嗓音质量敏感且跨语料库鲁棒性良好的嵌入特征。此外,我们在干净语料库以及三种变异的退化语料库(含语料库内和跨语料库)上,将我们的结果与三种基线方法进行了对比,证明所提模型始终优于基线方法。