Speech disfluency commonly occurs in conversational and spontaneous speech. However, standard Automatic Speech Recognition (ASR) models struggle to accurately recognize these disfluencies because they are typically trained on fluent transcripts. Current research mainly focuses on detecting disfluencies within transcripts, overlooking their exact location and duration in the speech. Additionally, previous work often requires model fine-tuning and addresses limited types of disfluencies. In this work, we present an inference-only approach to augment any ASR model with the ability to detect open-set disfluencies. We first demonstrate that ASR models have difficulty transcribing speech disfluencies. Next, this work proposes a modified Connectionist Temporal Classification(CTC)-based forced alignment algorithm from \cite{kurzinger2020ctc} to predict word-level timestamps while effectively capturing disfluent speech. Additionally, we develop a model to classify alignment gaps between timestamps as either containing disfluent speech or silence. This model achieves an accuracy of 81.62\% and an F1-score of 80.07\%. We test the augmentation pipeline of alignment gap detection and classification on a disfluent dataset. Our results show that we captured 74.13\% of the words that were initially missed by the transcription, demonstrating the potential of this pipeline for downstream tasks.
翻译:语音不流利现象在对话和自发语音中普遍存在。然而,标准自动语音识别(ASR)模型通常基于流畅文本进行训练,难以准确识别这些不流利现象。当前研究主要集中于在文本转录中检测不流利,而忽略了其在语音中的确切位置和持续时间。此外,先前工作通常需要对模型进行微调,且仅处理有限类型的不流利现象。本研究提出一种仅需推理的方法,可为任意ASR模型增强检测开放集不流利现象的能力。我们首先证明ASR模型在转录不流利语音时存在困难。接着,本研究基于\cite{kurzinger2020ctc}改进了一种基于连接时序分类(CTC)的强制对齐算法,在有效捕捉不流利语音的同时预测词级时间戳。此外,我们开发了一个模型,用于将时间戳之间的对齐间隙分类为包含不流利语音或静默。该模型达到了81.62%的准确率和80.07%的F1分数。我们在不流利数据集上测试了包含对齐间隙检测与分类的增强流程。结果表明,我们成功捕捉了转录初始遗漏的74.13%的词汇,证明了该流程在下游任务中的应用潜力。