This paper introduces FunASR, an open-source speech recognition toolkit designed to bridge the gap between academic research and industrial applications. FunASR offers models trained on large-scale industrial corpora and the ability to deploy them in applications. The toolkit's flagship model, Paraformer, is a non-autoregressive end-to-end speech recognition model that has been trained on a manually annotated Mandarin speech recognition dataset that contains 60,000 hours of speech. To improve the performance of Paraformer, we have added timestamp prediction and hotword customization capabilities to the standard Paraformer backbone. In addition, to facilitate model deployment, we have open-sourced a voice activity detection model based on the Feedforward Sequential Memory Network (FSMN-VAD) and a text post-processing punctuation model based on the controllable time-delay Transformer (CT-Transformer), both of which were trained on industrial corpora. These functional modules provide a solid foundation for building high-precision long audio speech recognition services. Compared to other models trained on open datasets, Paraformer demonstrates superior performance.
翻译:本文介绍了FunASR,一个旨在弥合学术研究与工业应用之间差距的开源语音识别工具包。FunASR提供了基于大规模工业语料库训练的模型,并支持这些模型的部署应用。该工具包的旗舰模型Paraformer是一种非自回归的端到端语音识别模型,该模型基于包含6万小时语音的人工标注普通话语音识别数据集训练而成。为提升Paraformer的性能,我们在标准Paraformer主干上增加了时间戳预测和热词定制功能。此外,为了便于模型部署,我们开源了基于前馈序列记忆网络的语音活动检测模型(FSMN-VAD)和基于可控时延Transformer的文本后处理标点模型(CT-Transformer),这两个模型均基于工业语料库训练。这些功能模块为构建高精度长语音识别服务提供了坚实基础。与基于开放数据集训练的其他模型相比,Paraformer展现了更优的性能。