The existing audio datasets are predominantly tailored towards single languages, overlooking the complex linguistic behaviors of multilingual communities that engage in code-switching. This practice, where individuals frequently mix two or more languages in their daily interactions, is particularly prevalent in multilingual regions such as Hong Kong, China. To bridge this gap, we have developed a 34.8-hour dataset of Mixed Cantonese and English (MCE) audio using our Multi-Agent Data Generation Framework (MADGF). We fine-tuned the open-source multilingual Automatic Speech Recognition (ASR) model, Whisper, with the MCE dataset, leading to impressive zero-shot performance. The traditional metrics overlook important factors such as latency in real-world applications and code-switching scenarios. We have introduced a novel evaluation metric called Fidelity to the Original Audio, Accuracy, and Latency (FAL). This metric aims to overcome the limitations of traditional metrics used to assess ASR systems.
翻译:现有音频数据集主要面向单一语言,忽视了多语种社区中普遍存在的代码切换这一复杂语言行为。代码切换指个体在日常交流中频繁混合使用两种或更多语言的现象,在中国香港等多语言地区尤为常见。为填补这一研究空白,本研究采用多智能体数据生成框架构建了34.8小时的粤英混合音频数据集。通过对开源多语言自动语音识别模型Whisper进行MCE数据集微调,实现了优异的零样本性能。传统评估指标忽略了实际应用中的延迟和代码切换场景等重要因素。为此,我们提出了名为"音频保真度、准确率与延迟度"的新型评估指标,旨在克服传统ASR系统评估指标的局限性。