Augmenting Automatic Speech Recognition Models with Disfluency Detection

Speech disfluency commonly occurs in conversational and spontaneous speech. However, standard Automatic Speech Recognition (ASR) models struggle to accurately recognize these disfluencies because they are typically trained on fluent transcripts. Current research mainly focuses on detecting disfluencies within transcripts, overlooking their exact location and duration in the speech. Additionally, previous work often requires model fine-tuning and addresses limited types of disfluencies. In this work, we present an inference-only approach to augment any ASR model with the ability to detect open-set disfluencies. We first demonstrate that ASR models have difficulty transcribing speech disfluencies. Next, this work proposes a modified Connectionist Temporal Classification(CTC)-based forced alignment algorithm from \cite{kurzinger2020ctc} to predict word-level timestamps while effectively capturing disfluent speech. Additionally, we develop a model to classify alignment gaps between timestamps as either containing disfluent speech or silence. This model achieves an accuracy of 81.62\% and an F1-score of 80.07\%. We test the augmentation pipeline of alignment gap detection and classification on a disfluent dataset. Our results show that we captured 74.13\% of the words that were initially missed by the transcription, demonstrating the potential of this pipeline for downstream tasks.

翻译：语音不流利现象在对话和自发语音中普遍存在。然而，标准自动语音识别（ASR）模型通常基于流畅文本进行训练，难以准确识别这些不流利现象。当前研究主要集中于在文本转录中检测不流利，而忽略了其在语音中的确切位置和持续时间。此外，先前工作通常需要对模型进行微调，且仅处理有限类型的不流利现象。本研究提出一种仅需推理的方法，可为任意ASR模型增强检测开放集不流利现象的能力。我们首先证明ASR模型在转录不流利语音时存在困难。接着，本研究基于\cite{kurzinger2020ctc}改进了一种基于连接时序分类（CTC）的强制对齐算法，在有效捕捉不流利语音的同时预测词级时间戳。此外，我们开发了一个模型，用于将时间戳之间的对齐间隙分类为包含不流利语音或静默。该模型达到了81.62%的准确率和80.07%的F1分数。我们在不流利数据集上测试了包含对齐间隙检测与分类的增强流程。结果表明，我们成功捕捉了转录初始遗漏的74.13%的词汇，证明了该流程在下游任务中的应用潜力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日