PESTS: Persian_English Cross Lingual Corpus for Semantic Textual Similarity

One of the components of natural language processing that has received a lot of investigation recently is semantic textual similarity. In computational linguistics and natural language processing, assessing the semantic similarity of words, phrases, paragraphs, and texts is crucial. Calculating the degree of semantic resemblance between two textual pieces, paragraphs, or phrases provided in both monolingual and cross-lingual versions is known as semantic similarity. Cross lingual semantic similarity requires corpora in which there are sentence pairs in both the source and target languages with a degree of semantic similarity between them. Many existing cross lingual semantic similarity models use a machine translation due to the unavailability of cross lingual semantic similarity dataset, which the propagation of the machine translation error reduces the accuracy of the model. On the other hand, when we want to use semantic similarity features for machine translation the same machine translations should not be used for semantic similarity. For Persian, which is one of the low resource languages, no effort has been made in this regard and the need for a model that can understand the context of two languages is felt more than ever. In this article, the corpus of semantic textual similarity between sentences in Persian and English languages has been produced for the first time by using linguistic experts. We named this dataset PESTS (Persian English Semantic Textual Similarity). This corpus contains 5375 sentence pairs. Also, different models based on transformers have been fine-tuned using this dataset. The results show that using the PESTS dataset, the Pearson correlation of the XLM ROBERTa model increases from 85.87% to 95.62%.

翻译：自然语言处理中近期受到广泛研究的组成之一便是语义文本相似性。在计算语言学和自然语言处理中，评估词语、短语、段落及文本的语义相似性至关重要。计算单语和跨语言版本中两个文本片段、段落或短语之间的语义相似程度被称为语义相似性。跨语言语义相似性需要存在源语言和目标语言中具有语义相似性程度的句子对的语料库。由于缺乏跨语言语义相似性数据集，许多现有的跨语言语义相似性模型采用机器翻译，而机器翻译误差的传播降低了模型准确性。另一方面，当我们想利用语义相似性特征进行机器翻译时，不应对语义相似性使用相同的机器翻译结果。对于波斯语这一低资源语言，目前尚未有相关研究，且对能理解两种语言上下文的模型需求愈发迫切。本文首次借助语言专家，生成了波斯语与英语句子间的语义文本相似性语料库，并将其命名为PESTS（波斯语-英语语义文本相似性）。该语料库包含5375个句子对。此外，基于Transformer的不同模型已利用此数据集进行了微调。结果显示，使用PESTS数据集后，XLM ROBERTa模型的皮尔逊相关系数从85.87%提升至95.62%。