In this paper, we present a dataset for the computational study of a number of Modern Greek dialects. It consists of raw text data from four dialects of Modern Greek, Cretan, Pontic, Northern Greek and Cypriot Greek. The dataset is of considerable size, albeit imbalanced, and presents the first attempt to create large scale dialectal resources of this type for Modern Greek dialects. We then use the dataset to perform dialect idefntification. We experiment with traditional ML algorithms, as well as simple DL architectures. The results show very good performance on the task, potentially revealing that the dialects in question have distinct enough characteristics allowing even simple ML models to perform well on the task. Error analysis is performed for the top performing algorithms showing that in a number of cases the errors are due to insufficient dataset cleaning.
翻译:本文提出了一个面向多种现代希腊方言计算研究的数据集。该数据集包含四种现代希腊方言(克里特方言、本都方言、北方希腊方言和塞浦路斯希腊方言)的原始文本数据。尽管数据集存在不平衡性,但其规模相当可观,且这是首次为现代希腊方言构建此类大规模方言资源的尝试。我们随后利用该数据集进行方言识别任务,实验涵盖了传统机器学习算法及简单深度学习架构。结果表明该任务取得了良好性能,这潜在揭示了所研究的四种方言具有足够鲜明的特征,即使简单机器学习模型也能在该任务中表现优异。我们对性能最优的算法进行了错误分析,发现部分错误源于数据集清洗不充分。