Lexical-semantic resources (LSRs), such as online lexicons or wordnets, are fundamental for natural language processing applications. In many languages, however, such resources suffer from quality issues: incorrect entries, incompleteness, but also, the rarely addressed issue of bias towards the English language and Anglo-Saxon culture. Such bias manifests itself in the absence of concepts specific to the language or culture at hand, the presence of foreign (Anglo-Saxon) concepts, as well as in the lack of an explicit indication of untranslatability, also known as cross-lingual \emph{lexical gaps}, when a term has no equivalent in another language. This paper proposes a novel crowdsourcing methodology for reducing bias in LSRs. Crowd workers compare lexemes from two languages, focusing on domains rich in lexical diversity, such as kinship or food. Our LingoGap crowdsourcing tool facilitates comparisons through microtasks identifying equivalent terms, language-specific terms, and lexical gaps across languages. We validated our method by applying it to two case studies focused on food-related terminology: (1) English and Arabic, and (2) Standard Indonesian and Banjarese. These experiments identified 2,140 lexical gaps in the first case study and 951 in the second. The success of these experiments confirmed the usability of our method and tool for future large-scale lexicon enrichment tasks.
翻译:词汇语义资源(如在线词典或词网)是自然语言处理应用的基础。然而,在许多语言中,此类资源存在质量问题:错误词条、不完整性,以及一个鲜少被提及的问题——对英语和盎格鲁-撒克逊文化的偏向性。这种偏向性表现为缺失目标语言或文化特有的概念、包含外来(盎格鲁-撒克逊)概念,以及缺乏对不可译性的明确标注(即跨语言词汇空缺,指某一术语在另一种语言中没有对应词)。本文提出了一种新颖的众包方法,用于减少词汇语义资源中的偏向性。众包工作者通过微任务比较两种语言的词位,重点关注词汇多样性丰富的领域,如亲属关系或食物。我们开发的LingoGap众包工具通过识别对等术语、语言特有术语和跨语言词汇空缺的微任务来促进比较。我们通过两个聚焦食物相关术语的案例研究验证了该方法:(1)英语与阿拉伯语,(2)标准印尼语与班贾尔语。实验在第一个案例中识别出2,140个词汇空缺,在第二个案例中识别出951个。这些实验的成功证实了我们的方法与工具在未来大规模词典丰富化任务中的可用性。