In recent years, advancements in the field of speech processing have led to cutting-edge deep learning algorithms with immense potential for real-world applications. The automated identification of stuttered speech is one of such applications that the researchers are addressing by employing deep learning techniques. Recently, researchers have utilized Wav2vec2.0, a speech recognition model to classify disfluency types in stuttered speech. Although Wav2vec2.0 has shown commendable results, its ability to generalize across all disfluency types is limited. In addition, since its base model uses 12 encoder layers, it is considered a resource-intensive model. Our study unravels the capabilities of Whisper for the classification of disfluency types in stuttered speech. We have made notable contributions in three pivotal areas: enhancing the quality of SEP28-k benchmark dataset, exploration of Whisper for classification, and introducing an efficient encoder layer freezing strategy. The optimized Whisper model has achieved the average F1-score of 0.81, which proffers its abilities. This study also unwinds the significance of deeper encoder layers in the identification of disfluency types, as the results demonstrate their greater contribution compared to initial layers. This research represents substantial contributions, shifting the emphasis towards an efficient solution, thereby thriving towards prospective innovation.
翻译:近年来,语音处理领域的进步催生了尖端的深度学习算法,这些算法在实际应用中展现出巨大潜力。口吃语音的自动识别正是研究人员利用深度学习技术解决的此类应用之一。近期,研究人员利用语音识别模型Wav2vec2.0对口吃语音中的不流利类型进行分类。尽管Wav2vec2.0取得了令人称赞的结果,但其在所有不流利类型上的泛化能力有限。此外,由于其基础模型使用了12个编码器层,被认为是一个资源密集型模型。我们的研究揭示了Whisper在口吃语音不流利类型分类中的能力。我们在三个关键领域做出了显著贡献:提升SEP28-k基准数据集质量、探索Whisper用于分类,以及引入高效的编码器层冻结策略。优化后的Whisper模型取得了平均F1分数0.81的成绩,展示了其能力。本研究还揭示了更深层编码器层在不流利类型识别中的重要性,因为结果表明它们相比初始层贡献更大。这项研究代表了实质性进展,将重心转向高效解决方案,从而推动未来的创新。