In this paper, we introduce the Polish Massive Text Embedding Benchmark (PL-MTEB), a comprehensive benchmark for text embeddings in the Polish language. PL-MTEB comprises 30 diverse NLP tasks across five categories: classification, clustering, pair classification, information retrieval, and semantic text similarity. Within the scope of this work, we added 12 new Polish-language tasks to MTEB based on existing datasets and prepared two new datasets used to create four clustering tasks. We evaluated 30 publicly available text embedding models, including Polish and multilingual models. We analyzed the results in detail for specific task types and model sizes. We made the prepared datasets, the source code for evaluation, and the obtained results available to the public at https://github.com/rafalposwiata/pl-mteb.
翻译:本文提出了波兰语大规模文本嵌入基准(PL-MTEB),这是一个针对波兰语文本嵌入任务的综合性基准测试体系。PL-MTEB包含30项多样化自然语言处理任务,涵盖分类、聚类、成对分类、信息检索和语义文本相似度五大类别。在本研究框架内,我们在现有数据集基础上为MTEB新增了12项波兰语任务,并制备了两个新数据集用于构建四项聚类任务。我们评估了30个可公开获取的文本嵌入模型,包括波兰语模型和多语言模型。针对不同任务类型及模型规模的具体表现进行了详细分析。所制备的数据集、评估源码及实验结果均已开放获取,地址为https://github.com/rafalposwiata/pl-mteb。