Contextualized embeddings are the preferred tool for modeling Lexical Semantic Change (LSC). Current evaluations typically focus on a specific task known as Graded Change Detection (GCD). However, performance comparison across work are often misleading due to their reliance on diverse settings. In this paper, we evaluate state-of-the-art models and approaches for GCD under equal conditions. We further break the LSC problem into Word-in-Context (WiC) and Word Sense Induction (WSI) tasks, and compare models across these different levels. Our evaluation is performed across different languages on eight available benchmarks for LSC, and shows that (i) APD outperforms other approaches for GCD; (ii) XL-LEXEME outperforms other contextualized models for WiC, WSI, and GCD, while being comparable to GPT-4; (iii) there is a clear need for improving the modeling of word meanings, as well as focus on how, when, and why these meanings change, rather than solely focusing on the extent of semantic change.
翻译:语境化嵌入是建模词汇语义变化(Lexical Semantic Change, LSC)的首选工具。当前的评估通常侧重于一个特定任务,即分级变化检测(Graded Change Detection, GCD)。然而,由于不同研究依赖多样化的设置,其性能比较往往具有误导性。在本文中,我们在同等条件下评估了用于GCD的最先进模型和方法。我们进一步将LSC问题分解为词在上下文(Word-in-Context, WiC)和词义归纳(Word Sense Induction, WSI)任务,并在这些不同层级上比较各模型。我们的评估基于八个LSC公开基准在多种语言上进行,结果表明:(i)APD在GCD任务中优于其他方法;(ii)XL-LEXEME在WiC、WSI和GCD任务中均优于其他语境化模型,且与GPT-4性能相当;(iii)当前亟需改进对词义的建模能力,并关注词义变化的方式、时机与原因,而非仅聚焦于语义变化的程度。