$\textit{Swap and Predict}$ -- Predicting the Semantic Changes in Words across Corpora by Context Swapping

Meanings of words change over time and across domains. Detecting the semantic changes of words is an important task for various NLP applications that must make time-sensitive predictions. We consider the problem of predicting whether a given target word, $w$, changes its meaning between two different text corpora, $\mathcal{C}_1$ and $\mathcal{C}_2$. For this purpose, we propose $\textit{Swapping-based Semantic Change Detection}$ (SSCD), an unsupervised method that randomly swaps contexts between $\mathcal{C}_1$ and $\mathcal{C}_2$ where $w$ occurs. We then look at the distribution of contextualised word embeddings of $w$, obtained from a pretrained masked language model (MLM), representing the meaning of $w$ in its occurrence contexts in $\mathcal{C}_1$ and $\mathcal{C}_2$. Intuitively, if the meaning of $w$ does not change between $\mathcal{C}_1$ and $\mathcal{C}_2$, we would expect the distributions of contextualised word embeddings of $w$ to remain the same before and after this random swapping process. Despite its simplicity, we demonstrate that even by using pretrained MLMs without any fine-tuning, our proposed context swapping method accurately predicts the semantic changes of words in four languages (English, German, Swedish, and Latin) and across different time spans (over 50 years and about five years). Moreover, our method achieves significant performance improvements compared to strong baselines for the English semantic change prediction task. Source code is available at https://github.com/a1da4/svp-swap .

翻译：词语的含义会随时间或领域而改变。检测词语的语义变化是一项重要的自然语言处理任务，适用于需要做出时效性预测的应用场景。我们研究的问题是：给定目标词 $w$，预测其在两个不同文本语料库 $\mathcal{C}_1$ 和 $\mathcal{C}_2$ 之间是否发生了语义变化。为此，我们提出了一种无监督方法——基于交换的语义变化检测（SSCD），该方法随机交换 $w$ 出现在 $\mathcal{C}_1$ 和 $\mathcal{C}_2$ 中的上下文。随后，我们使用预训练的掩码语言模型(MLM)提取 $w$ 的上下文化词嵌入分布，以此表示 $w$ 在 $\mathcal{C}_1$ 和 $\mathcal{C}_2$ 出现上下文中的含义。直观上，如果 $w$ 在 $\mathcal{C}_1$ 和 $\mathcal{C}_2$ 之间的含义未发生变化，那么在该随机交换过程前后，$w$ 的上下文化词嵌入分布应保持一致。尽管方法简单，我们证明了即使仅使用预训练MLM而不进行任何微调，所提出的上下文交换方法也能准确预测四种语言（英语、德语、瑞典语和拉丁语）以及不同时间跨度（超过50年与约5年）中词语的语义变化。此外，在英语语义变化预测任务中，我们的方法相较于强基线模型实现了显著的性能提升。源代码已开源至 https://github.com/a1da4/svp-swap 。