Lit2Vec: A Reproducible Workflow for Building a Legally Screened Chemistry Corpus from S2ORC for Downstream Retrieval and Text Mining

We present Lit2Vec, a reproducible workflow for constructing and validating a chemistry corpus from the Semantic Scholar Open Research Corpus using conservative, metadata-based license screening. Using this workflow, we assembled an internal study corpus of 582,683 chemistry-specific full-text research articles with structured full text, token-aware paragraph chunks, paragraph-level embeddings generated with the intfloat/e5-large-v2 model, and record-level metadata including abstracts and licensing information. To support downstream retrieval and text-mining use cases, an eligible subset of the corpus was additionally enriched with machine-generated brief summaries and multi-label subfield annotations spanning 18 chemistry domains. Licensing was screened using metadata from Unpaywall, OpenAlex, and Crossref, and the resulting corpus was technically validated for schema compliance, embedding reproducibility, text quality, and metadata completeness. The primary contribution of this work is a reproducible workflow for corpus construction and validation, together with its associated schema and reproducibility resources. The released materials include the code, reconstruction workflow, schema, metadata/provenance artifacts, and validation outputs needed to reproduce the corpus from pinned public upstream resources. Public redistribution of source-derived text and broad text-derived representations is outside the scope of the general release. Researchers can reproduce the workflow by using the released pipeline with publicly available upstream datasets and metadata services.

翻译：本文提出Lit2Vec，一个利用基于元数据的保守许可筛查方法，从语义学者开放研究语料库中构建并验证化学语料库的可复现工作流程。通过该工作流程，我们构建了一个包含582,683篇化学领域全文研究论文的内部研究语料库，涵盖结构化全文、基于token感知的段落分块、采用intfloat/e5-large-v2模型生成的段落级嵌入，以及包含摘要和许可信息的记录级元数据。为支持下游检索与文本挖掘应用，我们对语料库中符合条件的子集进一步添加了机器生成的简短摘要及覆盖18个化学领域的多标签子领域注释。许可筛查基于Unpaywall、OpenAlex和Crossref的元数据完成，并通过模式合规性、嵌入可复现性、文本质量及元数据完整性对语料库进行技术验证。本研究的核心贡献在于提出可复现的语料库构建与验证工作流程，及其关联的模式与复现资源。已发布的资源包括从固定公共上游资源重建语料库所需的代码、重建流程、模式、元数据/溯源信息及验证输出。基于源文本的公共再分发及广泛文本衍生表征不包含在通用发布范围内。研究者可通过使用已发布的数据管道与公开可用的上游数据集和元数据服务复现该工作流程。