Contextualized embeddings are the preferred tool for modeling Lexical Semantic Change (LSC). Current evaluations typically focus on a specific task known as Graded Change Detection (GCD). However, performance comparison across work are often misleading due to their reliance on diverse settings. In this paper, we evaluate state-of-the-art models and approaches for GCD under equal conditions. We further break the LSC problem into Word-in-Context (WiC) and Word Sense Induction (WSI) tasks, and compare models across these different levels. Our evaluation is performed across different languages on eight available benchmarks for LSC, and shows that (i) APD outperforms other approaches for GCD; (ii) XL-LEXEME outperforms other contextualized models for WiC, WSI, and GCD, while being comparable to GPT-4; (iii) there is a clear need for improving the modeling of word meanings, as well as focus on how, when, and why these meanings change, rather than solely focusing on the extent of semantic change.
翻译:上下文嵌入是建模词汇语义变化(LSC)的首选工具。当前评估通常聚焦于名为分级变化检测(GCD)的特定任务。然而,由于依赖不同实验设置,不同研究间的性能比较往往具有误导性。本文在同等条件下评估了用于GCD的最先进模型与方法。我们进一步将LSC问题分解为上下文中的词(WiC)和词义归纳(WSI)任务,并在这些不同层面上比较模型。我们的评估跨语言在八个LSC公开基准上进行,结果表明:(i)APD在GCD任务上优于其他方法;(ii)XL-LEXEME在WiC、WSI和GCD任务上优于其他上下文模型,且与GPT-4性能相当;(iii)当前亟需改进对词义的建模,同时应关注词义变化的方式、时机与原因,而非仅聚焦于语义变化程度。