We present a dataset of 833k paragraphs extracted from CC-BY licensed scientific publications, classified into four categories: acknowledgments, data mentions, software/code mentions, and clinical trial mentions. The paragraphs are primarily in English and French, with additional European languages represented. Each paragraph is annotated with language identification (using fastText) and scientific domain (from OpenAlex). This dataset, derived from the French Open Science Monitor corpus and processed using GROBID, enables training of text classification models and development of named entity recognition systems for scientific literature mining. The dataset is publicly available on HuggingFace https://doi.org/10.57967/hf/6679 under a CC-BY license.
翻译:我们提出了一个包含83.3万个段落的语料库,这些段落提取自采用CC-BY许可协议的科学出版物,并被划分为四种类别:致谢、数据提及、软件/代码提及以及临床试验提及。段落文本以英语和法语为主,同时涵盖其他欧洲语言。每个段落均通过fastText进行了语言识别标注,并利用OpenAlex标注了所属科学领域。该数据集源自法国开放科学监测语料库,并采用GROBID工具进行处理,可用于训练文本分类模型及开发面向科学文献挖掘的命名实体识别系统。本数据集以CC-BY许可协议发布于HuggingFace平台(https://doi.org/10.57967/hf/6679),可供公开获取。