Accurately detecting dysfluencies in spoken language can help to improve the performance of automatic speech and language processing components and support the development of more inclusive speech and language technologies. Inspired by the recent trend towards the deployment of large language models (LLMs) as universal learners and processors of non-lexical inputs, such as audio and video, we approach the task of multi-label dysfluency detection as a language modeling problem. We present hypotheses candidates generated with an automatic speech recognition system and acoustic representations extracted from an audio encoder model to an LLM, and finetune the system to predict dysfluency labels on three datasets containing English and German stuttered speech. The experimental results show that our system effectively combines acoustic and lexical information and achieves competitive results on the multi-label stuttering detection task.
翻译:准确检测口语中的非流畅性现象有助于提升自动语音与语言处理组件的性能,并推动更具包容性的语音与语言技术发展。受近期大语言模型(LLMs)作为音频、视频等非词汇输入通用学习器与处理器这一趋势的启发,我们将多标签非流畅性检测任务构建为语言建模问题。我们向LLM提供由自动语音识别系统生成的假设候选序列以及从音频编码器模型提取的声学表征,并在包含英语和德语口吃语音的三个数据集上对系统进行微调,以预测非流畅性标签。实验结果表明,我们的系统能有效融合声学与词汇信息,在多标签口吃检测任务中取得了具有竞争力的结果。