Natural Language Processing (NLP) is widely used to supply summarization ability from long context to structured information. However, extracting structured knowledge from scientific text by NLP models remains a challenge because of its domain-specific nature to complex data preprocessing and the granularity of multi-layered device-level information. To address this, we introduce ByteScience, a non-profit cloud-based auto fine-tuned Large Language Model (LLM) platform, which is designed to extract structured scientific data and synthesize new scientific knowledge from vast scientific corpora. The platform capitalizes on DARWIN, an open-source, fine-tuned LLM dedicated to natural science. The platform was built on Amazon Web Services (AWS) and provides an automated, user-friendly workflow for custom model development and data extraction. The platform achieves remarkable accuracy with only a small amount of well-annotated articles. This innovative tool streamlines the transition from the science literature to structured knowledge and data and benefits the advancements in natural informatics.
翻译:自然语言处理(NLP)被广泛用于从长上下文中提取摘要并生成结构化信息。然而,由于科学文本具有领域特定性,涉及复杂的数据预处理以及多层设备级信息的粒度问题,利用NLP模型从中提取结构化知识仍面临挑战。为此,我们推出了ByteScience——一个非营利性的、基于云的自动微调大语言模型(LLM)平台,旨在从海量科学语料库中提取结构化科学数据并合成新的科学知识。该平台依托DARWIN(一个专用于自然科学的开源微调LLM)构建,部署于亚马逊云服务(AWS),为用户提供自动化、易上手的定制模型开发与数据提取工作流。该平台仅需少量高质量标注文章即可实现卓越的准确率。这一创新工具简化了从科学文献到结构化知识与数据的转化过程,有助于推动自然信息学的发展。