An acoustic model, trained on a significant amount of unlabeled data, consists of a self-supervised learned speech representation useful for solving downstream tasks, perhaps after a fine-tuning of the model in the respective downstream task. In this work, we build an acoustic model of Brazilian Portuguese Speech through a Transformer neural network. This model was pretrained on more than $800$ hours of Brazilian Portuguese Speech, using a combination of pretraining techniques. Using a labeled dataset collected for the detection of respiratory insufficiency in Brazilian Portuguese speakers, we fine-tune the pretrained Transformer neural network on the following tasks: respiratory insufficiency detection, gender recognition and age group classification. We compare the performance of pretrained Transformers on these tasks with that of Transformers without previous pretraining, noting a significant improvement. In particular, the performance of respiratory insufficiency detection obtains the best reported results so far, indicating this kind of acoustic model as a promising tool for speech-as-biomarker approach. Moreover, the performance of gender recognition is comparable to the state of the art models in English.
翻译:声学模型在大量未标注数据上训练,由自监督学习的语音表示构成,可用于解决下游任务(可能需在下游任务中对该模型进行微调)。本研究通过Transformer神经网络构建了巴西葡萄牙语语音的声学模型。该模型采用预训练技术组合,在800小时以上的巴西葡萄牙语语音数据上完成预训练。利用为检测巴西葡萄牙语使用者呼吸功能不全而收集的标注数据集,我们对预训练Transformer神经网络进行微调,执行以下任务:呼吸功能不全检测、性别识别和年龄段分类。我们将预训练Transformer在这些任务上的表现与未经过预训练的Transformer进行对比,发现其性能显著提升。其中,呼吸功能不全检测性能达到了迄今最优报告结果,表明此类声学模型是语音生物标记方法的有力工具。此外,性别识别性能与英语领域的当前最优模型相当。