The paraphrase identification task involves measuring semantic similarity between two short sentences. It is a tricky task, and multilingual paraphrase identification is even more challenging. In this work, we train a bi-encoder model in a contrastive manner to detect hard paraphrases across multiple languages. This approach allows us to use model-produced embeddings for various tasks, such as semantic search. We evaluate our model on downstream tasks and also assess embedding space quality. Our performance is comparable to state-of-the-art cross-encoders, with only a minimal relative drop of 7-10% on the chosen dataset, while keeping decent quality of embeddings.
翻译:复述识别任务涉及测量两个短句之间的语义相似性。这是一项复杂的任务,而多语言复述识别则更具挑战性。在本研究中,我们以对比方式训练一个双编码器模型,用于检测跨多种语言的困难复述。该方法使我们能够将模型生成的嵌入向量用于多种任务,例如语义搜索。我们在下游任务上评估模型性能,并评估嵌入空间的质量。我们的模型性能与最先进的交叉编码器相当,在所选数据集上仅出现7-10%的相对小幅下降,同时保持了良好的嵌入质量。