Wordnets are indispensable tools for various natural language processing applications. Unfortunately, wordnets get outdated, and producing or updating wordnets can be slow and costly in terms of time and resources. This problem intensifies for low-resource languages. This study proposes a method for word sense induction and synset induction using only two linguistic resources, namely, an unlabeled corpus and a sentence embeddings-based language model. The resulting sense inventory and synonym sets can be used in automatically creating a wordnet. We applied this method on a corpus of Filipino text. The sense inventory and synsets were evaluated by matching them with the sense inventory of the machine translated Princeton WordNet, as well as comparing the synsets to the Filipino WordNet. This study empirically shows that the 30% of the induced word senses are valid and 40% of the induced synsets are valid in which 20% are novel synsets.
翻译:词网是各类自然语言处理应用中不可或缺的工具。然而,词网存在过时问题,且构建或更新词网在时间和资源上成本高昂,这一难题在低资源语言中尤为突出。本研究提出一种仅利用两种语言资源(非标注语料库和基于句子嵌入的语言模型)的词义归纳与同义词集归纳方法。由此生成的词义清单与同义词集可用于自动构建词网。我们将该方法应用于菲律宾语文本语料库,通过将词义清单与机器翻译的普林斯顿词网进行比对,并将同义词集与菲律宾语词网进行对照评估。实验表明,所归纳的词义中30%具有效性,同义词集中40%具有效性,其中20%为新型同义词集。