Stuttering is a neuro-developmental speech impairment characterized by uncontrolled utterances (interjections) and core behaviors (blocks, repetitions, and prolongations), and is caused by the failure of speech sensorimotors. Due to its complex nature, stuttering detection (SD) is a difficult task. If detected at an early stage, it could facilitate speech therapists to observe and rectify the speech patterns of persons who stutter (PWS). The stuttered speech of PWS is usually available in limited amounts and is highly imbalanced. To this end, we address the class imbalance problem in the SD domain via a multibranching (MB) scheme and by weighting the contribution of classes in the overall loss function, resulting in a huge improvement in stuttering classes on the SEP-28k dataset over the baseline (StutterNet). To tackle data scarcity, we investigate the effectiveness of data augmentation on top of a multi-branched training scheme. The augmented training outperforms the MB StutterNet (clean) by a relative margin of 4.18% in macro F1-score (F1). In addition, we propose a multi-contextual (MC) StutterNet, which exploits different contexts of the stuttered speech, resulting in an overall improvement of 4.48% in F 1 over the single context based MB StutterNet. Finally, we have shown that applying data augmentation in the cross-corpora scenario can improve the overall SD performance by a relative margin of 13.23% in F1 over the clean training.
翻译:口吃是一种神经发育性言语障碍,其特征表现为非自主发声(插入语)及核心行为(阻塞、重复和延长),由言语感觉运动功能失调引发。由于其复杂性,口吃检测(SD)是一项艰巨任务。若能在早期阶段检测出口吃,则有助于言语治疗师观察并矫正口吃患者(PWS)的言语模式。PWS的口吃言语数据通常数量有限且高度不平衡。为此,我们通过多分支(MB)方案并加权各类别在总损失函数中的贡献,解决了SD领域的类别不平衡问题,在SEP-28k数据集上相较于基线模型(StutterNet)实现了口吃类别的显著改进。为应对数据稀缺问题,我们研究了多分支训练方案基础上数据增强的有效性。增强训练相较于MB StutterNet(纯净数据)在宏F1分数(F1)上实现了4.18%的相对提升。此外,我们提出了多上下文(MC)StutterNet,该模型利用口吃言语的不同上下文信息,相较于基于单上下文的MB StutterNet在F1上实现了4.48%的总体提升。最后,我们证明在跨语料库场景中应用数据增强,相较于纯净数据训练可使整体SD性能在F1上获得13.23%的相对提升。