SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction

Keyphrase extraction for morphologically rich, low-resource languages remains understudied, largely due to the scarcity of suitable evaluation datasets. We address this gap for Slovak by constructing a dataset of 227,432 scientific abstracts with author-assigned keyphrases -- scraped and systematically cleaned from the Slovak Central Register of Theses -- representing a 25-fold increase over the largest prior Slovak resource and approaching the scale of established English benchmarks such as KP20K. Using this dataset, we benchmark three unsupervised baselines (YAKE, TextRank, KeyBERT with SlovakBERT embeddings) and evaluate KeyLLM, an LLM-based extraction method using GPT-3.5-turbo. Unsupervised baselines achieve at most 11.6\% exact-match $F1@6$, with a large gap to partial matching (up to 51.5\%), reflecting the difficulty of matching inflected surface forms to author-assigned keyphrases. KeyLLM narrows this exact--partial gap, producing keyphrases closer to the canonical forms assigned by authors, while manual evaluation on 100 documents ($κ= 0.61$) confirms that KeyLLM captures relevant concepts that automated exact matching underestimates. Our analysis identifies morphological mismatch as the dominant failure mode for statistical methods -- a finding relevant to other inflected languages. The dataset (https://huggingface.co/datasets/NaiveNeuron/SlovKE) and evaluation code (https://github.com/NaiveNeuron/SlovKE) are publicly available.

翻译：针对形态丰富、资源稀缺语言的关键词提取研究仍显不足，这主要源于缺乏合适的评估数据集。本研究针对斯洛伐克语填补了这一空白，通过构建包含227,432篇科学摘要及作者标注关键词的数据集——数据采集自斯洛伐克学位论文中央注册系统并经系统化清洗——其规模达到先前最大斯洛伐克语资源的25倍，并接近KP20K等成熟英语基准数据集的体量。基于该数据集，我们对三种无监督基线方法（YAKE、TextRank、采用SlovakBERT嵌入的KeyBERT）进行了基准测试，并评估了使用GPT-3.5-turbo的大型语言模型提取方法KeyLLM。无监督基线方法在精确匹配$F1@6$指标上最高仅达到11.6%，与部分匹配结果（最高51.5%）存在显著差距，这反映了屈折变化的表层形式与作者标注关键词之间的匹配难度。KeyLLM有效缩小了精确匹配与部分匹配间的差距，生成的关键词更接近作者标注的规范形式；同时基于100篇文档的人工评估（$κ=0.61$）证实，KeyLLM能捕捉到自动化精确匹配所低估的相关概念。我们的分析指出形态失配是统计方法的主要失效模式——这一发现对其他屈折语言同样具有参考价值。数据集（https://huggingface.co/datasets/NaiveNeuron/SlovKE）与评估代码（https://github.com/NaiveNeuron/SlovKE）已公开发布。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

大语言模型基准综述

专知会员服务

27+阅读 · 2025年8月22日

《使用生成式大语言模型进行多语言事件提取》最新85页

专知会员服务

23+阅读 · 2025年2月16日

《大语言模型的数据合成与增强综述》

专知会员服务

43+阅读 · 2024年10月19日

关键词提取算法研究综述

专知会员服务

31+阅读 · 2024年4月28日