Automatic speech recognition (ASR) has witnessed remarkable progress in recent years, largely driven by the emergence of LLM-based ASR paradigm. Despite their strong performance on a variety of open-source benchmarks, existing LLM-based ASR systems still suffer from two critical limitations. First, they are prone to hallucination errors, often generating excessively long and repetitive outputs that are not well grounded in the acoustic input. Second, they provide limited support for flexible and fine-grained contextual customization. To address these challenges, we propose Index-ASR, a large-scale LLM-based ASR system designed to simultaneously enhance robustness and support customizable hotword recognition. The core idea of Index-ASR lies in the integration of LLM and large-scale training data enriched with background noise and contextual information. Experimental results show that our Index-ASR achieves strong performance on both open-source benchmarks and in-house test sets, highlighting its robustness and practicality for real-world ASR applications.
翻译:近年来,自动语音识别(ASR)领域取得了显著进展,这主要得益于基于大语言模型(LLM)的ASR范式的兴起。尽管现有的基于LLM的ASR系统在各种开源基准测试中表现出色,但它们仍存在两个关键局限。首先,它们容易产生幻觉错误,经常生成过长且重复的输出,而这些输出并未充分基于声学输入。其次,它们对灵活且细粒度的上下文定制支持有限。为解决这些挑战,我们提出了Index-ASR,这是一个大规模、基于LLM的ASR系统,旨在同时增强鲁棒性并支持可定制的热词识别。Index-ASR的核心思想在于将LLM与富含背景噪声和上下文信息的大规模训练数据相结合。实验结果表明,我们的Index-ASR在开源基准测试和内部测试集上均取得了优异的性能,突显了其在现实世界ASR应用中的鲁棒性和实用性。