Automatically highlighting words that cause semantic differences between two documents could be useful for a wide range of applications. We formulate recognizing semantic differences (RSD) as a token-level regression task and study three unsupervised approaches that rely on a masked language model. To assess the approaches, we begin with basic English sentences and gradually move to more complex, cross-lingual document pairs. Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels. However, all unsupervised approaches still leave a large margin of improvement. Code to reproduce our experiments is available at https://github.com/ZurichNLP/recognizing-semantic-differences
翻译:自动突出显示导致两篇文档之间语义差异的词汇,可能对广泛的应用场景具有实用价值。我们将语义差异识别(RSD)定义为词元级别的回归任务,并研究了三种基于掩码语言模型的无监督方法。为了评估这些方法,我们从基础英语句子开始,逐步过渡到更复杂的跨语言文档对。实验结果表明,基于词对齐和句子级对比学习的方法与真实标注具有稳健的相关性。然而,所有无监督方法仍存在较大的改进空间。重现实验的代码可访问 https://github.com/ZurichNLP/recognizing-semantic-differences 获取。