The adoption of advanced deep learning architectures in stuttering detection (SD) tasks is challenging due to the limited size of the available datasets. To this end, this work introduces the application of speech embeddings extracted from pre-trained deep learning models trained on large audio datasets for different tasks. In particular, we explore audio representations obtained using emphasized channel attention, propagation, and aggregation time delay neural network (ECAPA-TDNN) and Wav2Vec2.0 models trained on VoxCeleb and LibriSpeech datasets respectively. After extracting the embeddings, we benchmark with several traditional classifiers, such as the K-nearest neighbour (KNN), Gaussian naive Bayes, and neural network, for the SD tasks. In comparison to the standard SD systems trained only on the limited SEP-28k dataset, we obtain a relative improvement of 12.08%, 28.71%, 37.9% in terms of unweighted average recall (UAR) over the baselines. Finally, we have shown that combining two embeddings and concatenating multiple layers of Wav2Vec2.0 can further improve the UAR by up to 2.60% and 6.32% respectively.
翻译:在口吃检测(SD)任务中,由于可用数据集规模有限,采用先进深度学习架构面临挑战。为此,本研究引入从预训练深度学习模型中提取的语音嵌入方法,这些模型基于大规模音频数据集针对不同任务进行训练。具体而言,我们探索了使用强调通道注意力、传播与聚合的时延神经网络(ECAPA-TDNN)以及Wav2Vec2.0模型分别基于VoxCeleb和LibriSpeech数据集训练得到的音频表征。提取嵌入特征后,我们采用多种传统分类器(如K近邻(KNN)、高斯朴素贝叶斯和神经网络)对SD任务进行基准测试。与仅基于有限SEP-28k数据集训练的常规SD系统相比,我们在非加权平均召回率(UAR)指标上分别获得了12.08%、28.71%和37.9%的相对提升。最后,我们证明将两种嵌入特征相结合并拼接Wav2Vec2.0的多个层级可分别使UAR进一步提升2.60%和6.32%。