Developing a Multilingual Dataset and Evaluation Metrics for Code-Switching: A Focus on Hong Kong's Polylingual Dynamics

The existing audio datasets are predominantly tailored towards single languages, overlooking the complex linguistic behaviors of multilingual communities that engage in code-switching. This practice, where individuals frequently mix two or more languages in their daily interactions, is particularly prevalent in multilingual regions such as Hong Kong, China. To bridge this gap, we have developed a 34.8-hour dataset of Mixed Cantonese and English (MCE) audio using our Multi-Agent Data Generation Framework (MADGF). We fine-tuned the open-source multilingual Automatic Speech Recognition (ASR) model, Whisper, with the MCE dataset, leading to impressive zero-shot performance. The traditional metrics overlook important factors such as latency in real-world applications and code-switching scenarios. We have introduced a novel evaluation metric called Fidelity to the Original Audio, Accuracy, and Latency (FAL). This metric aims to overcome the limitations of traditional metrics used to assess ASR systems.

翻译：现有音频数据集主要面向单一语言，忽视了多语种社区中普遍存在的代码切换这一复杂语言行为。代码切换指个体在日常交流中频繁混合使用两种或更多语言的现象，在中国香港等多语言地区尤为常见。为填补这一研究空白，本研究采用多智能体数据生成框架构建了34.8小时的粤英混合音频数据集。通过对开源多语言自动语音识别模型Whisper进行MCE数据集微调，实现了优异的零样本性能。传统评估指标忽略了实际应用中的延迟和代码切换场景等重要因素。为此，我们提出了名为"音频保真度、准确率与延迟度"的新型评估指标，旨在克服传统ASR系统评估指标的局限性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日