Interlinear glossed text (IGT) is the standard format for linguistic annotation in language documentation. Producing it manually, however, is often slow and costly. Automated glossing systems have improved substantially in recent years, but adoption among field linguists remains limited. Existing tools are designed to be evaluated rather than used, offering no interpretable path for correction or the incorporation of linguistic expertise back into model behavior. We present GlossAssist, a glossing tool built around the retrieval-based architecture of CWoMP (Contrastive Word-Morpheme Pre-training), which grounds predictions in a mutable lexicon of learned morpheme representations. In conjunction with CWoMP, our system treats each correction by an annotator as part of an active learning setting, which expands the lexicon and improves future predictions without having to retrain the model. In this paper, we present our interface and argue that this feedback loop should be treated as a design requirement for NLP tools aimed at documentary linguists.
翻译:行间注释文本(IGT)是语言文献记录中语言标注的标准格式。然而,手动生成此类文本通常既缓慢又昂贵。近年来,自动化注释系统已取得显著改进,但在田野语言学家中的采用率仍然有限。现有工具专为评估而非实际使用而设计,无法提供可解释的修正路径或将语言学专业知识融入模型行为。我们提出GlossAssist,一种基于CWMoMP(对比词素预训练)检索式架构构建的注释工具,该工具将预测结果建立在可变的已学习词素表示词典之上。与CWMoMP结合后,我们的系统将注释者的每次修正视为主动学习环境的一部分,从而扩展词典并在无需重新训练模型的情况下改进后续预测。本文介绍了我们的交互界面,并主张这种反馈循环应被视作为文献语言学家设计自然语言处理工具时的基本要求。