Automatically highlighting words that cause semantic differences between two documents could be useful for a wide range of applications. We formulate recognizing semantic differences (RSD) as a token-level regression task and study three unsupervised approaches that rely on a masked language model. To assess the approaches, we begin with basic English sentences and gradually move to more complex, cross-lingual document pairs. Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels. However, all unsupervised approaches still leave a large margin of improvement. Code to reproduce our experiments is available at https://github.com/ZurichNLP/recognizing-semantic-differences
翻译:自动高亮两个文档间引起语义差异的词语,可用于广泛的实际应用。我们将语义差异识别(RSD)形式化为词级回归任务,并研究了三种依赖掩码语言模型的无监督方法。为评估这些方法,我们从基础英语句子入手,逐步扩展到更复杂的跨语言文档对。实验结果表明,基于词对齐和句子级对比学习的方法与人工标注具有稳健的相关性。然而,所有无监督方法仍存在较大的改进空间。复现实验的代码见 https://github.com/ZurichNLP/recognizing-semantic-differences