The automated classification of stuttered speech has significant implications for timely assessments providing assistance to speech language pathologists. Despite notable advancements in the field, the cases in which multiple disfluencies occur in speech require attention. We have taken a progressive approach to fill this gap by classifying multi-stuttered speech more efficiently. The problem has been addressed by firstly curating a dataset of multi-stuttered disfluencies from open source dataset SEP-28k audio clips. Secondly, employing Whisper, a state-of-the-art speech recognition model has been leveraged by using its encoder and taking the problem as multi label classification. Thirdly, using a 6 encoder layer Whisper and experimenting with various layer freezing strategies, a computationally efficient configuration of the model was identified. The proposed configuration achieved micro, macro, and weighted F1-scores of 0.88, 0.85, and 0.87, correspondingly on an external test dataset i.e. Fluency-Bank. In addition, through layer freezing strategies, we were able to achieve the aforementioned results by fine-tuning a single encoder layer, consequently, reducing the model's trainable parameters from 20.27 million to 3.29 million. This research study unveils the contribution of the last encoder layer in the identification of disfluencies in stuttered speech. Consequently, it has led to a computationally efficient approach, 83.7% less parameters to train, making the proposed approach more adaptable for various dialects and languages.
翻译:口吃语音的自动分类对于及时评估并辅助言语病理学家具有重要意义。尽管该领域已取得显著进展,但语音中出现多种不流利现象的情况仍需关注。我们采用了一种渐进式方法来填补这一空白,旨在更高效地对多类口吃语音进行分类。首先,我们从开源数据集SEP-28k的音频片段中构建了一个多类口吃不流利数据集。其次,通过采用最先进的语音识别模型Whisper,利用其编码器并将该问题视为多标签分类任务进行处理。第三,使用6层编码器的Whisper模型,并通过实验多种层冻结策略,确定了一种计算高效的模型配置。所提出的配置在外部测试数据集(即Fluency-Bank)上取得了0.88、0.85和0.87的微平均、宏平均和加权F1分数。此外,通过层冻结策略,我们仅需微调单个编码器层即可达成上述结果,从而将模型的可训练参数从2027万减少至329万。本研究揭示了Whisper最后一层编码器在识别口吃语音不流利现象中的关键作用。由此产生的计算方法具有显著的高效性,可训练参数减少83.7%,使得所提出的方法更能适应不同方言和语言环境。