Polysemy and synonymy are two crucial interrelated facets of lexical ambiguity. While both phenomena are widely documented in lexical resources and have been studied extensively in NLP, leading to dedicated systems, they are often being considered independently in practical problems. While many tasks dealing with polysemy (e.g. Word Sense Disambiguiation or Induction) highlight the role of word's senses, the study of synonymy is rooted in the study of concepts, i.e. meanings shared across the lexicon. In this paper, we introduce Concept Induction, the unsupervised task of learning a soft clustering among words that defines a set of concepts directly from data. This task generalizes Word Sense Induction. We propose a bi-level approach to Concept Induction that leverages both a local lemma-centric view and a global cross-lexicon view to induce concepts. We evaluate the obtained clustering on SemCor's annotated data and obtain good performance (BCubed F1 above 0.60). We find that the local and the global levels are mutually beneficial to induce concepts and also senses in our setting. Finally, we create static embeddings representing our induced concepts and use them on the Word-in-Context task, obtaining competitive performance with the State-of-the-Art.
翻译:多义词与同义词是词汇歧义中两个关键且相互关联的方面。尽管这两种现象在词汇资源中被广泛记载,并在自然语言处理领域得到深入研究,催生了专门的系统,但在实际问题中它们往往被独立考量。许多处理多义性的任务(如词义消歧或词义归纳)强调词义的作用,而同义词研究则植根于概念研究——即词汇表中共享的意义。本文提出概念归纳这一无监督任务,旨在直接从数据中学习定义概念集的词汇软聚类。该任务将词义归纳推广至更广义的范畴。我们提出一种双层概念归纳方法,同时利用局部词元中心视角和全局跨词典视角来归纳概念。通过在SemCor标注数据上评估所得聚类结果,我们获得了良好性能(BCubed F1分数超过0.60)。研究发现,在我们的设定中,局部与全局层级能相互促进概念及词义的归纳。最后,我们创建了代表所归纳概念的静态嵌入表示,并将其应用于上下文词汇任务,取得了与当前最优方法相竞争的性能。