Despite advances in large-scale Automatic Speech Recognition (ASR), disfluent speech remains challenging, as state-of-the-art systems are often optimized to omit disfluencies, leading to information loss and hallucinations. Prior work has focused on verbatim transcription and the integration of disfluency markers, but adapting models on limited datasets can lead to catastrophic forgetting of general-domain knowledge. We address this gap by leveraging continual learning (CL) with explicit disfluency tokens. We first introduce these tokens into a pretrained ASR model to establish stable token mechanisms, and then continue training on additional datasets with varying disfluency distributions. Through a detailed analysis of model dynamics during training, we identify a trade-off between marker learning and ASR performance, and a consistent cross-attention head mechanism shared across CL methods.
翻译:尽管大规模自动语音识别(ASR)已取得显著进展,但非流利语音仍是挑战,因为当前最先进的系统往往被优化以忽略非流利现象,从而导致信息丢失和幻觉。先前工作侧重于逐字转录及非流利标记的集成,但在有限数据集上适配模型可能导致通用领域知识的灾难性遗忘。我们通过利用带有显式非流利标记的连续学习(CL)来填补这一空白。首先,我们将这些标记引入预训练ASR模型以建立稳定的标记机制,随后在具有不同非流利分布的数据集上继续训练。通过对训练过程中模型动态的详细分析,我们发现了标记学习与ASR性能之间的权衡,以及所有CL方法中共有的交叉注意力头机制。