Current Grammar Error Correction (GEC) initiatives tend to focus on major languages, with less attention given to low-resource languages like Esperanto. In this article, we begin to bridge this gap by first conducting a comprehensive frequency analysis using the Eo-GP dataset, created explicitly for this purpose. We then introduce the Eo-GEC dataset, derived from authentic user cases and annotated with fine-grained linguistic details for error identification. Leveraging GPT-3.5 and GPT-4, our experiments show that GPT-4 outperforms GPT-3.5 in both automated and human evaluations, highlighting its efficacy in addressing Esperanto's grammatical peculiarities and illustrating the potential of advanced language models to enhance GEC strategies for less commonly studied languages.
翻译:当前语法错误修正(GEC)研究主要集中于主要语言,对世界语等低资源语言的关注较少。本文首先利用专门为此创建的Eo-GP数据集进行全面的频率分析,以填补这一空白。随后,我们推出Eo-GEC数据集,该数据集基于真实用户案例,并附有细粒度语言标注以识别错误。通过使用GPT-3.5和GPT-4,实验表明GPT-4在自动评估和人工评估中的表现均优于GPT-3.5,凸显了其在处理世界语特殊语法结构方面的有效性,并揭示了先进语言模型在增强低资源语言GEC研究策略中的潜力。