In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are ASR models that support language identification and ASR for 52 languages and dialects. Both of them leverage large-scale speech training data and the strong audio understanding ability of their foundation model Qwen3-Omni. We conduct comprehensive internal evaluation besides the open-sourced benchmarks as ASR models might differ little on open-sourced benchmark scores but exhibit significant quality differences in real-world scenarios. The experiments reveal that the 1.7B version achieves SOTA performance among open-sourced ASR models and is competitive with the strongest proprietary APIs while the 0.6B version offers the best accuracy-efficiency trade-off. Qwen3-ASR-0.6B can achieve an average TTFT as low as 92ms and transcribe 2000 seconds speech in 1 second at a concurrency of 128. Qwen3-ForcedAligner-0.6B is an LLM based NAR timestamp predictor that is able to align text-speech pairs in 11 languages. Timestamp accuracy experiments show that the proposed model outperforms the three strongest force alignment models and takes more advantages in efficiency and versatility. To further accelerate the community research of ASR and audio understanding, we release these models under the Apache 2.0 license.
翻译:本报告介绍了Qwen3-ASR系列模型,包含两个功能强大的端到端语音识别模型和一个新颖的非自回归语音强制对齐模型。Qwen3-ASR-1.7B和Qwen3-ASR-0.6B是支持52种语言和方言的语言识别与语音识别的ASR模型。两者均利用了大规模语音训练数据及其基础模型Qwen3-Omni强大的音频理解能力。鉴于ASR模型在开源基准测试中得分差异可能微小,但在实际场景中却表现出显著的质量差异,我们在开源基准之外进行了全面的内部评估。实验表明,1.7B版本在开源ASR模型中达到了SOTA性能,并与最强的商业API模型具有竞争力,而0.6B版本则提供了最佳的精度-效率权衡。Qwen3-ASR-0.6B可实现平均TTFT低至92毫秒,并在128并发下,1秒内转录2000秒的语音。Qwen3-ForcedAligner-0.6B是一个基于LLM的非自回归时间戳预测器,能够对齐11种语言的文本-语音对。时间戳准确性实验表明,所提模型性能优于三个最强的强制对齐模型,并在效率和多功能性方面更具优势。为加速ASR和音频理解领域的社区研究,我们在Apache 2.0许可下开源了这些模型。