The advent of natural language processing and large language models (LLMs) has revolutionized the extraction of data from unstructured scholarly papers. However, ensuring data trustworthiness remains a significant challenge. In this paper, we introduce PropertyExtractor, an open-source tool that leverages advanced conversational LLMs like Google Gemini-Pro and OpenAI GPT-4, blends zero-shot with few-shot in-context learning, and employs engineered prompts for the dynamic refinement of structured information hierarchies, enabling autonomous, efficient, scalable, and accurate identification, extraction, and verification of material property data. Our tests on material data demonstrate precision and recall exceeding 93% with an error rate of approximately 10%, highlighting the effectiveness and versatility of the toolkit. We apply PropertyExtractor to generate a database of 2D material thicknesses, a critical parameter for device integration. The rapid evolution of the field has outpaced both experimental measurements and computational methods, creating a significant data gap. Our work addresses this gap and showcases the potential of PropertyExtractor as a reliable and efficient tool for the autonomous generation of diverse material property databases, advancing the field.
翻译:自然语言处理与大型语言模型(LLMs)的出现彻底革新了从非结构化学术论文中提取数据的方法。然而,确保数据的可信度仍是一项重大挑战。本文介绍了PropertyExtractor这一开源工具,它利用Google Gemini-Pro和OpenAI GPT-4等先进对话型LLM,融合零样本与少样本上下文学习技术,并采用工程化提示对结构化信息层次进行动态优化,从而实现对材料性能数据的自主、高效、可扩展且精准的识别、提取与验证。我们在材料数据上的测试显示,精确率与召回率超过93%,错误率约为10%,突显了该工具包的有效性与通用性。我们应用PropertyExtractor生成了二维材料厚度数据库——这是器件集成中的关键参数。该领域的快速发展已超越实验测量与计算方法的步伐,造成显著的数据缺口。我们的工作填补了这一缺口,并展示了PropertyExtractor作为自主生成多样化材料性能数据库的可靠高效工具的潜力,推动了该领域的发展。