Automatically highlighting words that cause semantic differences between two documents could be useful for a wide range of applications. We formulate recognizing semantic differences (RSD) as a token-level regression task and study three unsupervised approaches that rely on a masked language model. To assess the approaches, we begin with basic English sentences and gradually move to more complex, cross-lingual document pairs. Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels. However, all unsupervised approaches still leave a large margin of improvement. Code to reproduce our experiments is available at https://github.com/ZurichNLP/recognizing-semantic-differences
翻译:自动高亮导致两篇文档间语义差异的词汇,对广泛的应用场景具有潜在价值。本文将语义差异识别(RSD)建模为词级回归任务,并研究了三种基于掩码语言模型的无监督方法。为评估这些方法,我们从基础英语句子出发,逐步过渡到更复杂的跨语言文档对。实验结果表明,基于词对齐与句子级对比学习的方法与黄金标签具有稳健的相关性。然而,所有无监督方法仍存在较大的改进空间。重现实验的代码托管于 https://github.com/ZurichNLP/recognizing-semantic-differences