Contextualized embeddings are the preferred tool for modeling Lexical Semantic Change (LSC). Current evaluations typically focus on a specific task known as Graded Change Detection (GCD). However, performance comparison across work are often misleading due to their reliance on diverse settings. In this paper, we evaluate state-of-the-art models and approaches for GCD under equal conditions. We further break the LSC problem into Word-in-Context (WiC) and Word Sense Induction (WSI) tasks, and compare models across these different levels. Our evaluation is performed across different languages on eight available benchmarks for LSC, and shows that (i) APD outperforms other approaches for GCD; (ii) XL-LEXEME outperforms other contextualized models for WiC, WSI, and GCD, while being comparable to GPT-4; (iii) there is a clear need for improving the modeling of word meanings, as well as focus on how, when, and why these meanings change, rather than solely focusing on the extent of semantic change.
翻译:语境化嵌入是建模词汇语义变化(Lexical Semantic Change, LSC)的首选工具。当前的评估通常侧重于称为分级变化检测(Graded Change Detection, GCD)的特定任务。然而,由于依赖不同的实验设置,不同研究之间的性能比较往往具有误导性。在本文中,我们在同等条件下评估了用于GCD的最新模型和方法。我们进一步将LSC问题分解为词在上下文中(Word-in-Context, WiC)和词义归纳(Word Sense Induction, WSI)任务,并在这些不同层面比较模型。我们在八个可用的LSC基准数据集上跨不同语言进行评估,结果表明:(i)APD在GCD上优于其他方法;(ii)XL-LEXEME在WiC、WSI和GCD上优于其他语境化模型,且与GPT-4性能相当;(iii)当前迫切需要改进词义的建模,并关注词义如何、何时以及为何发生变化,而非仅仅聚焦于语义变化的程度。