SciLaD is a novel, large-scale dataset of scientific language constructed entirely using open-source frameworks and publicly available data sources. It comprises a curated English split containing over 10 million scientific publications and a multilingual, unfiltered TEI XML split including more than 35 million publications. We also publish the extensible pipeline for generating SciLaD. The dataset construction and processing workflow demonstrates how open-source tools can enable large-scale, scientific data curation while maintaining high data quality. Finally, we pre-train a RoBERTa model on our dataset and evaluate it across a comprehensive set of benchmarks, achieving performance comparable to other scientific language models of similar size, validating the quality and utility of SciLaD. We publish the dataset and evaluation pipeline to promote reproducibility, transparency, and further research in natural scientific language processing and understanding, including scholarly document processing.
翻译:SciLaD是一个全新的、大规模科学语言数据集,完全基于开源框架和公开数据源构建。该数据集包含一个经整理的英文子集(涵盖超过1000万篇科学出版物)和一个多语言、未过滤的TEI XML子集(包含超过3500万篇出版物)。我们还发布了用于生成SciLaD的可扩展管道。数据集的构建与处理流程展示了如何利用开源工具实现大规模、高质量的科学数据整理。最后,我们在该数据集上预训练了一个RoBERTa模型,并在全面基准测试中进行了评估,其性能与同等规模的其他科学语言模型相当,验证了SciLaD的质量与实用性。我们公开发布该数据集及评估管道,旨在促进自然科学语言处理与理解(包括学术文档处理)领域的可复现性、透明性及后续研究。