We introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data. Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. The top scoring model achieves a macro F1 score of 59.83% and a micro F1 score of 62.08%, indicating that the task is challenging. We thus believe that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification. We release our dataset at https://github.com/codrut2/RoDia.
翻译:我们提出了RoDia,这是首个用于罗马尼亚语语音方言识别的数据集。该数据集包含来自罗马尼亚五个不同地区、涵盖城乡环境的多样语音样本,总计2小时的人工标注语音数据。伴随该数据集,我们提出了多组作为未来研究基线的竞争性模型。其中得分最高的模型取得了59.83%的宏F1分数和62.08%的微F1分数,表明该任务具有挑战性。因此我们相信RoDia将是一项有价值的资源,能够推动旨在解决罗马尼亚语方言识别难题的研究。我们在https://github.com/codrut2/RoDia公开了该数据集。