We investigate automatic interlinear glossing in low-resource settings. We augment a hard-attentional neural model with embedded translation information extracted from interlinear glossed text. After encoding these translations using large language models, specifically BERT and T5, we introduce a character-level decoder for generating glossed output. Aided by these enhancements, our model demonstrates an average improvement of 3.97\%-points over the previous state of the art on datasets from the SIGMORPHON 2023 Shared Task on Interlinear Glossing. In a simulated ultra low-resource setting, trained on as few as 100 sentences, our system achieves an average 9.78\%-point improvement over the plain hard-attentional baseline. These results highlight the critical role of translation information in boosting the system's performance, especially in processing and interpreting modest data sources. Our findings suggest a promising avenue for the documentation and preservation of languages, with our experiments on shared task datasets indicating significant advancements over the existing state of the art.
翻译:我们研究了低资源场景下的自动行间标注问题。通过从行间标注文本中提取嵌入翻译信息,我们对硬注意力神经模型进行了增强。在利用大型语言模型(特别是BERT和T5)对这些翻译进行编码后,我们引入了一个字符级解码器来生成标注输出。借助这些改进,我们的模型在SIGMORPHON 2023行间标注共享任务的数据集上,比此前的最优结果平均提升了3.97个百分点。在模拟的超低资源场景下(仅用100个句子进行训练),我们的系统相较于纯硬注意力基线模型实现了平均9.78个百分点的提升。这些结果凸显了翻译信息在提升系统性能(尤其是在处理与解读有限数据源时)中的关键作用。我们的发现为语言记录与保护提供了一条有前景的途径,在共享任务数据集上的实验表明,相比现有最优技术取得了显著进展。