Scholarly data are largely fragmented across siloed databases with divergent metadata and missing linkages among them. We present the Science Data Lake, a locally-deployable infrastructure built on DuckDB and simple Parquet files that unifies eight open sources - Semantic Scholar, OpenAlex, SciSciNet, Papers with Code, Retraction Watch, Reliance on Science, a preprint-to-published mapping, and Crossref - via DOI normalization while preserving source-level schemas. The resource comprises approximately 960GB of Parquet files spanning ~293 million uniquely identifiable papers across ~22 schemas and ~153 SQL views. An embedding-based ontology alignment using BGE-large sentence embeddings maps 4,516 OpenAlex topics to 13 scientific ontologies (~1.3 million terms), yielding 16,150 mappings covering 99.8% of topics ($\geq 0.65$ threshold) with $F1 = 0.77$ at the recommended $\geq 0.85$ operating point, outperforming TF-IDF, BM25, and Jaro-Winkler baselines on a 300-pair gold-standard evaluation. We validate through 10 automated checks, cross-source citation agreement analysis (pairwise Pearson $r = 0.76$ - $0.87$), and stratified manual annotation. Four vignettes demonstrate cross-source analyses infeasible with any single database. The resource is open source, deployable on a single drive or queryable remotely via HuggingFace, and includes structured documentation suitable for large language model (LLM) based research agents.
翻译:学术数据长期以来分散在各自独立的数据库中,其元数据标准各异且缺乏相互关联。本文提出科学数据湖——一个基于DuckDB和简易Parquet文件构建、支持本地部署的基础设施。该系统通过DOI规范化整合了八个开放数据源(Semantic Scholar、OpenAlex、SciSciNet、Papers with Code、Retraction Watch、Reliance on Science、预印本-出版物映射库及Crossref),同时保留源级数据模式。该资源包含约960GB的Parquet文件,涵盖约2.93亿篇唯一可识别论文,涉及约22种数据模式和约153个SQL视图。基于BGE-large句子嵌入的本体对齐方法将4,516个OpenAlex主题映射至13个科学本体(约130万个术语),在推荐阈值$\geq 0.85$的操作点下获得16,150组映射(覆盖99.8%的主题),在300对黄金标准评估中F1分数达0.77,性能优于TF-IDF、BM25和Jaro-Winkler基线方法。我们通过10项自动化检查、跨源引用一致性分析(两两皮尔逊相关系数$r = 0.76$ - $0.87$)以及分层人工标注进行验证。四个应用案例展示了任何单一数据库均无法实现的跨源分析能力。该资源完全开源,可在单驱动器部署或通过HuggingFace远程查询,并提供适用于基于大语言模型(LLM)研究智能体的结构化文档。