In this paper, we introduce the Polish Massive Text Embedding Benchmark (PL-MTEB), a comprehensive benchmark for text embeddings in Polish. The PL-MTEB consists of 28 diverse NLP tasks from 5 task types. We adapted the tasks based on previously used datasets by the Polish NLP community. In addition, we created a new PLSC (Polish Library of Science Corpus) dataset consisting of titles and abstracts of scientific publications in Polish, which was used as the basis for two novel clustering tasks. We evaluated 15 publicly available models for text embedding, including Polish and multilingual ones, and collected detailed results for individual tasks and aggregated results for each task type and the entire benchmark. PL-MTEB comes with open-source code at https://github.com/rafalposwiata/pl-mteb.
翻译:本文介绍了波兰大规模文本嵌入基准(PL-MTEB),这是一个针对波兰语文本嵌入的综合性基准。PL-MTEB包含来自5种任务类型的28项多样化自然语言处理任务。我们基于波兰自然语言处理社区先前使用的数据集对任务进行了适配。此外,我们创建了新的PLSC(波兰科学文献语料库)数据集,该数据集包含波兰语科学出版物的标题和摘要,并作为两项新型聚类任务的基础。我们评估了15个公开可用的文本嵌入模型,包括波兰语和多语言模型,并收集了个别任务的详细结果以及按任务类型和整个基准聚合的结果。PL-MTEB的开源代码可在https://github.com/rafalposwiata/pl-mteb获取。