Morpheme glossing is a critical task in automated language documentation and can benefit other downstream applications greatly. While state-of-the-art glossing systems perform very well for languages with large amounts of existing data, it is more difficult to create useful models for low-resource languages. In this paper, we propose the use of a taxonomic loss function that exploits morphological information to make morphological glossing more performant when data is scarce. We find that while the use of this loss function does not outperform a standard loss function with regards to single-label prediction accuracy, it produces better predictions when considering the top-n predicted labels. We suggest this property makes the taxonomic loss function useful in a human-in-the-loop annotation setting.
翻译:语素标注是自动化语言文档中的一项关键任务,并能显著惠及其他下游应用。尽管最先进的标注系统在拥有大量现有数据的语言上表现优异,但为低资源语言创建实用模型却更为困难。本文提出利用一种基于形态信息的分类损失函数,以在数据稀缺时提升形态标注的性能。研究发现,虽然该损失函数在单标签预测准确率方面并未优于标准损失函数,但在考虑前n个预测标签时能产生更优结果。我们推测这一特性使得分类损失函数在人机协同标注场景中具有实用价值。