We introduce romanization encoding for script-heavy languages to optimize multilingual and code-switching Automatic Speech Recognition (ASR) systems. By adopting romanization encoding alongside a balanced concatenated tokenizer within a FastConformer-RNNT framework equipped with a Roman2Char module, we significantly reduce vocabulary and output dimensions, enabling larger training batches and reduced memory consumption. Our method decouples acoustic modeling and language modeling, enhancing the flexibility and adaptability of the system. In our study, applying this method to Mandarin-English ASR resulted in a remarkable 63.51% vocabulary reduction and notable performance gains of 13.72% and 15.03% on SEAME code-switching benchmarks. Ablation studies on Mandarin-Korean and Mandarin-Japanese highlight our method's strong capability to address the complexities of other script-heavy languages, paving the way for more versatile and effective multilingual ASR systems.
翻译:本文针对文字密集型语言引入罗马化编码,以优化多语言及语码转换自动语音识别系统。通过在配备Roman2Char模块的FastConformer-RNNT框架中,采用罗马化编码与平衡级联分词器相结合的策略,我们显著降低了词汇表与输出维度,从而支持更大规模的训练批次并减少内存消耗。该方法实现了声学建模与语言建模的解耦,提升了系统的灵活性与适应性。在普通话-英语自动语音识别任务中应用本方法,词汇量大幅减少63.51%,并在SEAME语码转换基准测试中分别获得13.72%和15.03%的显著性能提升。针对普通话-韩语及普通话-日语的消融实验进一步证明,本方法能有效应对其他文字密集型语言的复杂性,为开发更具通用性和高效性的多语言自动语音识别系统开辟了新途径。