For sensible progress in natural language processing, it is important that we are aware of the limitations of the evaluation metrics we use. In this work, we evaluate how robust metrics are to non-standardized dialects, i.e. spelling differences in language varieties that do not have a standard orthography. To investigate this, we collect a dataset of human translations and human judgments for automatic machine translations from English to two Swiss German dialects. We further create a challenge set for dialect variation and benchmark existing metrics' performances. Our results show that existing metrics cannot reliably evaluate Swiss German text generation outputs, especially on segment level. We propose initial design adaptations that increase robustness in the face of non-standardized dialects, although there remains much room for further improvement. The dataset, code, and models are available here: https://github.com/textshuttle/dialect_eval
翻译:为促进自然语言处理领域的合理进展,我们有必要了解所用评估指标的局限性。本研究评估现有指标对非标准化方言(即缺乏标准正字法的语言变体拼写差异)的鲁棒性。为此,我们构建了从英语到两种瑞士德语方言的人工翻译数据集及其人工质量评判。我们进一步创建了方言变体挑战集,并对现有指标的性能进行基准测试。结果表明,现有指标无法可靠评估瑞士德语文本生成输出,尤其在句子级别上。我们提出初步设计改进方案以增强对非标准化方言的鲁棒性,尽管仍有极大改进空间。数据集、代码及模型均已开放获取:https://github.com/textshuttle/dialect_eval