Language models and software tools are essential to support the continuing vitality of lesser-used languages; however, currently popular neural models require considerable data for training, which normally is not available for such low-resource languages. This paper describes work-in-progress to construct a rule-based model of Gaidhlig morphology using data from Wiktionary, arguing that rule-based systems effectively leverage limited sample data, support greater interpretability, and provide insights useful in the design of teaching materials. The use of SQL for querying the occurrence of different lexical patterns is investigated, and a declarative rule-base is presented that allows Python utilities to derive inflected forms of Gaidhlig words. This functionality could be used to support educational tools that teach or explain language patterns, for example, or to support higher level tools such as rule-based dependency parsers. This approach adds value to the data already present in Wiktionary by adapting it to new use-cases.
翻译:语言模型与软件工具对于维持使用较少的语言的持续活力至关重要;然而,当前流行的神经模型需要大量数据进行训练,而这对于此类低资源语言通常难以获得。本文描述了利用维基词典数据构建基于规则的盖尔语形态学模型的进展中工作,论证了基于规则的系统能有效利用有限的样本数据、支持更强的可解释性,并为教学材料设计提供有益见解。研究探讨了使用SQL查询不同词汇模式的出现情况,并提出了一个声明式规则库,使Python工具能够推导盖尔语单词的屈折形式。该功能可用于支持教学或解释语言模式的教育工具,或为更高级的工具(如基于规则的依存句法分析器)提供支持。此方法通过将维基词典中已有数据适配至新用例,提升了其价值。