Automatic Speech Recognition (ASR) systems predominantly cater to monolingual inputs and struggle with the complexity introduced by mixed language audio. In this paper, we present a novel Multi-Agent Data Generation Framework (MADGF) to address this challenge. We finetune the open-source multilingual ASR model, Whisper, utilizing our generated Mixed Cantonese and English (MCE) audio dataset, Which achieved an impressive Mix Error Rate (MER) of 14.28%, 35.13% lower than the original model. Meanwhile, single language recognition ability is not affected, 12.6% Character Error Rate (CER) in Common voice zh-HK, 14.8% Word Error Rate (WER) in Common voice en. However, these metrics do not encompass all aspects critical to the ASR systems. Hence, we propose a novel evaluation metric called Fidelity to the Original Audio, Accuracy, and Latency (FAL).
翻译:自动语音识别(ASR)系统主要面向单语言输入,难以应对混合语言音频带来的复杂性。本文提出了一种新颖的多智能体数据生成框架(MADGF)以应对这一挑战。我们利用生成的粤语-英语混合音频数据集对开源多语言ASR模型Whisper进行微调,实现了14.28%的混合错误率(MER),较原始模型降低了35.13%。同时,单语言识别能力未受影响,在Common voice zh-HK数据集上字符错误率(CER)为12.6%,在Common voice en数据集上词错误率(WER)为14.8%。然而,这些指标并未涵盖ASR系统的所有关键方面。因此,我们提出了一种名为原始音频保真度、准确性与延迟(FAL)的新型评估指标。