Polysemy and synonymy are two crucial interrelated facets of lexical ambiguity. While both phenomena have been studied extensively in NLP, leading to dedicated systems, they are often been considered independently. While many tasks dealing with polysemy (e.g. Word Sense Disambiguiation or Induction) highlight the role of a word's senses, the study of synonymy is rooted in the study of concepts, i.e. meaning shared across the lexicon. In this paper, we introduce Concept Induction, the unsupervised task of learning a soft clustering among words that defines a set of concepts directly from data. This task generalizes that of Word Sense Induction. We propose a bi-level approach to Concept Induction that leverages both a local lemma-centric view and a global cross-lexicon perspective to induce concepts. We evaluate the obtained clustering on SemCor's annotated data and obtain good performances (BCubed F1 above 0.60). We find that the local and the global levels are mutually beneficial to induce concepts and also senses in our setting. Finally, we create static embeddings representing our induced concepts and use them on the Word-in-Context task, obtaining competitive performances with the State-of-the-Art.
翻译:多义词与同义词是词汇歧义中两个关键且相互关联的方面。尽管这两种现象在自然语言处理领域已得到广泛研究,并催生了专门的系统,但它们往往被独立考量。许多处理多义性的任务(如词义消歧或词义归纳)强调单词义项的作用,而同义词研究则植根于对概念(即词汇间共享的意义)的探讨。本文提出概念归纳这一无监督任务,旨在直接从数据中学习定义概念集的词汇软聚类。该任务可视为词义归纳的泛化形式。我们提出一种双层概念归纳方法,同时利用局部词元中心视角和全局跨词汇视角来归纳概念。我们在SemCor标注数据上评估所得聚类结果,取得了良好性能(BCubed F1值超过0.60)。研究发现,在我们的设定中,局部层面与全局层面对于归纳概念及词义具有相互促进的作用。最后,我们创建了表征归纳概念的静态嵌入向量,并将其应用于上下文词汇任务,取得了与当前最先进技术相竞争的性能表现。