The Science Data Lake: A Unified Open Infrastructure Integrating 293 Million Papers Across Eight Scholarly Sources with Embedding-Based Ontology Alignment

翻译：科学数据湖：基于嵌入本体对齐整合八大学术来源中2.93亿篇论文的统一开放基础设施

Jonas Wilinski

from arxiv, 18 pages, 8 figures, 7 tables. Dataset DOI: 10.57967/hf/7850. Code: https://github.com/J0nasW/science-datalake

Scholarly data are largely fragmented across siloed databases with divergent metadata and missing linkages among them. We present the Science Data Lake, a locally-deployable infrastructure built on DuckDB and simple Parquet files that unifies eight open sources - Semantic Scholar, OpenAlex, SciSciNet, Papers with Code, Retraction Watch, Reliance on Science, a preprint-to-published mapping, and Crossref - via DOI normalization while preserving source-level schemas. The resource comprises approximately 960GB of Parquet files spanning ~293 million uniquely identifiable papers across ~22 schemas and ~153 SQL views. An embedding-based ontology alignment using BGE-large sentence embeddings maps 4,516 OpenAlex topics to 13 scientific ontologies (~1.3 million terms), yielding 16,150 mappings covering 99.8% of topics ($\geq 0.65$ threshold) with $F1 = 0.77$ at the recommended $\geq 0.85$ operating point, outperforming TF-IDF, BM25, and Jaro-Winkler baselines on a 300-pair gold-standard evaluation. We validate through 10 automated checks, cross-source citation agreement analysis (pairwise Pearson $r = 0.76$ - $0.87$), and stratified manual annotation. Four vignettes demonstrate cross-source analyses infeasible with any single database. The resource is open source, deployable on a single drive or queryable remotely via HuggingFace, and includes structured documentation suitable for large language model (LLM) based research agents.

翻译：学术数据长期以来分散在各自独立的数据库中，其元数据标准各异且缺乏相互关联。本文提出科学数据湖——一个基于DuckDB和简易Parquet文件构建、支持本地部署的基础设施。该系统通过DOI规范化整合了八个开放数据源（Semantic Scholar、OpenAlex、SciSciNet、Papers with Code、Retraction Watch、Reliance on Science、预印本-出版物映射库及Crossref），同时保留源级数据模式。该资源包含约960GB的Parquet文件，涵盖约2.93亿篇唯一可识别论文，涉及约22种数据模式和约153个SQL视图。基于BGE-large句子嵌入的本体对齐方法将4,516个OpenAlex主题映射至13个科学本体（约130万个术语），在推荐阈值$\geq 0.85$的操作点下获得16,150组映射（覆盖99.8%的主题），在300对黄金标准评估中F1分数达0.77，性能优于TF-IDF、BM25和Jaro-Winkler基线方法。我们通过10项自动化检查、跨源引用一致性分析（两两皮尔逊相关系数$r = 0.76$ - $0.87$）以及分层人工标注进行验证。四个应用案例展示了任何单一数据库均无法实现的跨源分析能力。该资源完全开源，可在单驱动器部署或通过HuggingFace远程查询，并提供适用于基于大语言模型（LLM）研究智能体的结构化文档。