Disfluency detection has mainly been solved in a pipeline approach, as post-processing of speech recognition. In this study, we propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection, which work in a streaming manner. Compared to pipeline approaches, the joint models can leverage acoustic information that makes disfluency detection robust to recognition errors and provide non-verbal clues. Moreover, joint modeling results in low-latency and lightweight inference. We investigate two joint model variants for streaming disfluency detection: a transcript-enriched model and a multi-task model. The transcript-enriched model is trained on text with special tags indicating the starting and ending points of the disfluent part. However, it has problems with latency and standard language model adaptation, which arise from the additional disfluency tags. We propose a multi-task model to solve such problems, which has two output layers at the Transformer decoder; one for speech recognition and the other for disfluency detection. It is modeled to be conditioned on the currently recognized token with an additional token-dependency mechanism. We show that the proposed joint models outperformed a BERT-based pipeline approach in both accuracy and latency, on both the Switchboard and the corpus of spontaneous Japanese.
翻译:不流畅检测通常采用流水线方法,作为语音识别的后处理步骤。在本研究中,我们提出基于Transformer的编码器-解码器模型,该模型能够以流式方式联合解决语音识别和不流畅检测问题。与流水线方法相比,联合模型可以利用声学信息,使不流畅检测对识别错误具有鲁棒性,并提供非言语线索。此外,联合建模可实现低延迟和轻量级推理。我们研究了两种用于流式不流畅检测的联合模型变体:转录增强模型和多任务模型。转录增强模型使用包含特殊标签的文本进行训练,这些标签指示不流畅部分的起点和终点。然而,该模型存在延迟和标准语言模型适应性问题,这些问题源于额外的不流畅标签。为了解决这些问题,我们提出了一种多任务模型,该模型在Transformer解码器处具有两个输出层:一个用于语音识别,另一个用于不流畅检测。该模型通过额外的令牌依赖机制,使其条件化于当前识别的令牌。我们证明,所提出的联合模型在Switchboard和自发性日语语料库上,在准确性和延迟方面均优于基于BERT的流水线方法。