We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4$\times$ the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, we develop \texttt{e5-sk-small} (45M parameters) and \texttt{e5-sk-large} (365M) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. Despite size reductions of up to 62\%, our open-source models achieve competitive performance with proprietary APIs while remaining locally deployable for semantic search and retrieval-augmented generation (RAG). We release the benchmark, models, datasets, and code openly, hoping our approach offers a replicable path for other under-resourced languages.
翻译:我们提出SkMTEB——首个面向斯洛伐克语(低资源西斯拉夫语支语言)的全面型MTEB格式文本嵌入基准,涵盖7类任务共31个数据集,其覆盖深度达到现有斯洛伐克语多语言基准的4倍以上。通过对31个嵌入模型的评估发现:大型指令微调多语言模型表现最优,而现有面向斯洛伐克语的NLU任务专用模型在嵌入任务中迁移效果不佳。为满足高效且可本地部署的斯洛伐克语嵌入需求,我们通过对多语言E5模型实施词汇表剪裁与微调,开发了\texttt{e5-sk-small}(4500万参数)和\texttt{e5-sk-large}(3.65亿参数)模型。尽管模型体积缩减高达62%,我们的开源模型在语义搜索与检索增强生成(RAG)场景中仍能保持与商业API相竞争的效能,且支持本地部署。我们公开了基准框架、模型、数据集及代码,期待该方法为其他资源匮乏语言提供可复现的适配路径。