This work presents a novel framework for training Arabic nested embedding models through Matryoshka Embedding Learning, leveraging multilingual, Arabic-specific, and English-based models, to highlight the power of nested embeddings models in various Arabic NLP downstream tasks. Our innovative contribution includes the translation of various sentence similarity datasets into Arabic, enabling a comprehensive evaluation framework to compare these models across different dimensions. We trained several nested embedding models on the Arabic Natural Language Inference triplet dataset and assessed their performance using multiple evaluation metrics, including Pearson and Spearman correlations for cosine similarity, Manhattan distance, Euclidean distance, and dot product similarity. The results demonstrate the superior performance of the Matryoshka embedding models, particularly in capturing semantic nuances unique to the Arabic language. Results demonstrated that Arabic Matryoshka embedding models have superior performance in capturing semantic nuances unique to the Arabic language, significantly outperforming traditional models by up to 20-25\% across various similarity metrics. These results underscore the effectiveness of language-specific training and highlight the potential of Matryoshka models in enhancing semantic textual similarity tasks for Arabic NLP.
翻译:本研究提出了一种通过Matryoshka嵌入学习训练阿拉伯语嵌套嵌入模型的新框架,该框架融合了多语言、阿拉伯语专用及基于英语的模型,旨在凸显嵌套嵌入模型在各类阿拉伯语自然语言处理下游任务中的效能。我们的创新性贡献包括将多种句子相似度数据集翻译为阿拉伯语,从而构建了能够从多维度比较这些模型的综合评估框架。我们在阿拉伯语自然语言推理三元组数据集上训练了多个嵌套嵌入模型,并使用多种评估指标对其性能进行了测评,包括针对余弦相似度、曼哈顿距离、欧几里得距离及点积相似度的皮尔逊与斯皮尔曼相关性分析。结果表明,Matryoshka嵌入模型展现出卓越的性能,尤其在捕捉阿拉伯语特有的语义细微差别方面。具体而言,阿拉伯语Matryoshka嵌入模型在各类相似度指标上显著优于传统模型,性能提升幅度达20-25%。这些发现印证了语言特异性训练的有效性,并彰显了Matryoshka模型在提升阿拉伯语自然语言处理语义文本相似性任务方面的潜力。