Connectionist Temporal Classification (CTC) models are popular for their balance between speed and performance for Automatic Speech Recognition (ASR). However, these CTC models still struggle in other areas, such as personalization towards custom words. A recent approach explores Contextual Adapters, wherein an attention-based biasing model for CTC is used to improve the recognition of custom entities. While this approach works well with enough data, we showcase that it isn't an effective strategy for low-resource languages. In this work, we propose a supervision loss for smoother training of the Contextual Adapters. Further, we explore a multilingual strategy to improve performance with limited training data. Our method achieves 48% F1 improvement in retrieving unseen custom entities for a low-resource language. Interestingly, as a by-product of training the Contextual Adapters, we see a 5-11% Word Error Rate (WER) reduction in the performance of the base CTC model as well.
翻译:连接主义时间分类(CTC)模型因其在自动语音识别(ASR)中速度与性能的良好平衡而广受欢迎。然而,这些CTC模型在自定义词个性化等其他领域仍面临挑战。近期研究探索了上下文适配器方法,即通过基于注意力的CTC偏置模型来改善自定义实体的识别效果。尽管这种方法在数据充足时表现良好,但我们证明其对于低资源语言并非有效策略。本文提出一种监督损失函数,以实现上下文适配器的更平滑训练。此外,我们探索了多语言策略以在有限训练数据下提升性能。实验表明,我们的方法在低资源语言中识别未见自定义实体时,F1分数提升了48%。有趣的是,作为上下文适配器训练的附带效果,基础CTC模型的词错误率(WER)也降低了5-11%。