Multi-Disciplinary Dataset Discovery from Citation-Verified Literature Contexts

Identifying suitable datasets for a research question remains challenging because existing dataset search engines rely heavily on metadata quality and keyword overlap, which often fail to capture the semantic intent of scientific investigation. We introduce a literature-driven framework that discovers datasets from citation contexts in scientific papers, enabling retrieval grounded in actual research use rather than metadata availability. Our approach combines large-scale citation-context extraction, schema-guided dataset recognition with Large Language Models, and provenance-preserving entity resolution. We evaluate the system on eight survey-derived computer science queries and find that it achieves substantially higher recall than Google Dataset Search and DataCite Commons, with normalized recall ranging from an average of 47.47% to a highest value of 81.82%. Beyond recovering gold-standard datasets, the method also surfaces additional datasets not documented in the surveys. Expert assessments across five top-level Fields of Science indicate that a substantial portion of the additional datasets are considered high utility, and some are regarded as novel for the specific topics chosen by the experts. These findings establish citation-context mining as an effective and generalizable paradigm for dataset discovery, particularly in settings where datasets lack sufficient or reliable metadata. To support reproducibility and future extensions, we release our code, evaluation datasets, and results on GitHub (https://github.com/Fireblossom/citation-context-dataset-discovery).

翻译：为特定研究问题识别合适的数据集仍然具有挑战性，因为现有的数据集搜索引擎严重依赖元数据质量和关键词重叠，而这通常无法捕捉科学探究的语义意图。我们引入了一种基于文献的框架，该框架从科学论文的引证上下文中发现数据集，使得检索能够基于实际的研究使用情况，而非元数据的可获得性。我们的方法结合了大规模引证上下文提取、基于大语言模型的模式引导数据集识别以及保留来源信息的实体解析。我们在八个源自综述的计算机科学查询上评估了该系统，发现其召回率显著高于谷歌数据集搜索和DataCite Commons，归一化召回率平均为47.47%，最高可达81.82%。除了能恢复黄金标准数据集外，该方法还能发现综述中未记载的额外数据集。在五个顶级科学领域的专家评估表明，相当一部分额外数据集被认为具有高实用性，其中一些对于专家选择的特定主题而言被认为是新颖的。这些发现确立了引证上下文挖掘作为一种有效且可泛化的数据集发现范式，尤其是在数据集缺乏充分或可靠元数据的环境中。为了支持可重复性和未来的扩展，我们在GitHub上发布了我们的代码、评估数据集和结果（https://github.com/Fireblossom/citation-context-dataset-discovery）。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【博士论文】推进数据高效的深度学习：非参数 Transformer、主动测试与上下文学习

专知会员服务

25+阅读 · 2025年8月7日