Recent findings show that pre-trained wav2vec 2.0 models are reliable feature extractors for various speaker characteristics classification tasks. We show that latent representations extracted at different layers of a pre-trained wav2vec 2.0 system can be used as features for binary classification to distinguish between children with Cleft Lip and Palate (CLP) and a healthy control group. The results indicate that the distinction between CLP and healthy voices, especially with latent representations from the lower and middle encoder layers, reaches an accuracy of 100%. We test the classifier to find influencing factors for classification using unseen out-of-domain healthy and pathologic corpora with varying characteristics: age, spoken content, and acoustic conditions. Cross-pathology and cross-healthy tests reveal that the trained classifiers are unreliable if there is a mismatch between training and out-of-domain test data in, e.g., age, spoken content, or acoustic conditions.
翻译:近期研究表明,预训练的wav2vec 2.0模型可作为各类说话人特征分类任务的可靠特征提取器。我们证明,从预训练wav2vec 2.0系统不同层级提取的潜在表征可作为二分类特征,用于区分唇腭裂(CLP)儿童与健康对照组。实验结果显示,利用潜层表征(尤其来自编码器低层与中层)区分CLP与健康语音的准确率可达100%。为探究分类影响因素,我们采用包含不同年龄、语音内容及声学条件的未见领域内健康及病理语料库对分类器进行测试。跨病理与跨健康测试表明,当训练数据与领域外测试数据在年龄、语音内容或声学条件等方面存在不匹配时,训练所得分类器的可靠性将显著降低。