RoDia: A New Dataset for Romanian Dialect Identification from Speech

Dialect identification is a critical task in speech processing and language technology, enhancing various applications such as speech recognition, speaker verification, and many others. While most research studies have been dedicated to dialect identification in widely spoken languages, limited attention has been given to dialect identification in low-resource languages, such as Romanian. To address this research gap, we introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data. Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. The top scoring model achieves a macro F1 score of 59.83% and a micro F1 score of 62.08%, indicating that the task is challenging. We thus believe that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification. We publicly release our dataset and code at https://github.com/codrut2/RoDia.

翻译：方言识别是语音处理和语言技术中的关键任务，可提升语音识别、说话人验证等多种应用的效果。尽管已有大量研究致力于广泛使用语言的方言识别，但针对罗马尼亚语等低资源语言的方言识别研究仍较为有限。为填补这一研究空白，我们提出RoDia——首个用于罗马尼亚语方言语音识别的数据集。该数据集包含来自罗马尼亚五个不同区域的多样化语音样本，覆盖城乡环境，总计2小时的人工标注语音数据。除数据集外，我们还提供了一组竞争性模型作为未来研究的基线。其中最优模型的宏F1分数为59.83%，微F1分数为62.08%，表明该任务具有挑战性。因此，我们相信RoDia作为宝贵资源将推动面向罗马尼亚语方言识别挑战的研究。我们已在https://github.com/codrut2/RoDia 公开发布数据集及代码。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日