Languages are dynamic entities, where the meanings associated with words constantly change with time. Detecting the semantic variation of words is an important task for various NLP applications that must make time-sensitive predictions. Existing work on semantic variation prediction have predominantly focused on comparing some form of an averaged contextualised representation of a target word computed from a given corpus. However, some of the previously associated meanings of a target word can become obsolete over time (e.g. meaning of gay as happy), while novel usages of existing words are observed (e.g. meaning of cell as a mobile phone). We argue that mean representations alone cannot accurately capture such semantic variations and propose a method that uses the entire cohort of the contextualised embeddings of the target word, which we refer to as the sibling distribution. Experimental results on SemEval-2020 Task 1 benchmark dataset for semantic variation prediction show that our method outperforms prior work that consider only the mean embeddings, and is comparable to the current state-of-the-art. Moreover, a qualitative analysis shows that our method detects important semantic changes in words that are not captured by the existing methods. Source code is available at https://github.com/a1da4/svp-gauss .
翻译:语言是动态实体,其中词汇的意义随时间不断变化。检测词汇的语义变化对于必须进行时间敏感预测的各种自然语言处理应用而言是一项重要任务。现有的语义变化预测研究主要聚焦于比较从给定语料库计算出的目标词汇平均上下文表示的某种形式。然而,目标词汇的部分先前相关意义可能随时间变得过时(例如"gay"作为"快乐"的意义),同时观察到现有词汇的新用法(例如"cell"作为"手机"的意义)。我们认为仅凭平均表示无法准确捕捉此类语义变化,并提出一种方法,利用目标词汇整个上下文表示集合——我们称之为兄弟分布。在用于语义变化预测的SemEval-2020任务1基准数据集上的实验结果表明,我们的方法优于仅考虑平均嵌入的先前工作,并与当前最先进方法相当。此外,定性分析显示,我们的方法能检测到现有方法未能捕捉的重要词汇语义变化。源代码见 https://github.com/a1da4/svp-gauss 。