High-quality WordNets are crucial for achieving high-quality results in NLP applications that rely on such resources. However, the wordnets of most languages suffer from serious issues of correctness and completeness with respect to the words and word meanings they define, such as incorrect lemmas, missing glosses and example sentences, or an inadequate, Western-centric representation of the morphology and the semantics of the language. Previous efforts have largely focused on increasing lexical coverage while ignoring other qualitative aspects. In this paper, we focus on the Arabic language and introduce a major revision of the Arabic WordNet that addresses multiple dimensions of lexico-semantic resource quality. As a result, we updated more than 58% of the synsets of the existing Arabic WordNet by adding missing information and correcting errors. In order to address issues of language diversity and untranslatability, we also extended the wordnet structure by new elements: phrasets and lexical gaps.
翻译:高质量的词网对于依赖于此类资源的自然语言处理应用取得高质量结果至关重要。然而,大多数语言的词网在词汇和词义定义的准确性与完整性方面存在严重问题,例如错误的词元、缺失的释义和例句,或对语言形态及语义的西方中心化不充分表述。以往的工作主要侧重于增加词汇覆盖范围,而忽视了其他质量维度。本文聚焦于阿拉伯语,对阿拉伯语词网进行了一次重大修订,从多个维度解决了词汇语义资源的质量问题。我们通过对现有阿拉伯语词网中超过58%的同义词集进行缺失信息补充和错误修正,实现了更新。为解决语言多样性与不可译性问题,我们还通过新增元素——短语集和词汇空白——扩展了词网结构。