With the exponential increase in online scientific literature, identifying reliable domain-specific data has become increasingly important but also very challenging. Manual data collection and filtering for domain-specific scientific literature is not only time-consuming but also labor-intensive and prone to errors and inconsistencies. To facilitate automated data collection, the paper introduces a web-based tool that leverages Large Language Models (LLMs) for automated and scalable development of open scientific databases. More specifically, the tool is based on an automated and unified framework that combines keyword-based querying, API-enabled data retrieval, and LLM-powered text classification to construct domain-specific scientific databases. Data is collected from multiple reliable data sources and search engines using a parallel querying technique to construct a combined unified dataset. The dataset is subsequently filtered using LLMs queried with prompts tailored for each keyword-based query to extract the relevant data to a scientific query of interest. The approach was tested across a set of variable keyword-based searches for different domain-specific tasks related to agriculture and crop yield. The results and analysis show 90\% overlap with small domain expert-curated databases, suggesting that the proposed tool can be used to significantly reduce manual workload. Furthermore, the proposed framework is both scalable and domain-agnostic and can be applied across diverse fields for building scalable open scientific databases.
翻译:随着在线科学文献的指数级增长,识别可靠的领域特定数据变得日益重要,但也极具挑战性。针对领域特定科学文献的手动数据收集与筛选不仅耗时耗力,且容易产生错误与不一致性。为促进自动化数据收集,本文介绍一种基于网络平台的工具,该工具利用大型语言模型实现开放科学数据库的自动化可扩展构建。具体而言,该工具基于一个自动化统一框架,该框架结合基于关键词的查询、支持API的数据检索以及基于LLM的文本分类技术,以构建领域特定的科学数据库。通过并行查询技术从多个可靠数据源及搜索引擎收集数据,构建出统一的整合数据集。随后使用经过针对性提示词调优的LLM对数据集进行过滤,以提取与目标科学查询相关的数据。该方法在一系列针对农业与作物产量相关领域特定任务的变体关键词搜索中进行了测试。结果与分析显示,其与小型专家精编数据库的重合度达到90%,表明所提工具能显著减少人工工作量。此外,所提出的框架兼具可扩展性与领域无关性,可应用于不同学科领域以构建可扩展的开放科学数据库。