This research explores the integration of large language models (LLMs) into scientific data assimilation, focusing on combustion science as a case study. Leveraging foundational models integrated with Retrieval-Augmented Generation (RAG) framework, the study introduces an approach to process diverse combustion research data, spanning experimental studies, simulations, and literature. The multifaceted nature of combustion research emphasizes the critical role of knowledge processing in navigating and extracting valuable information from a vast and diverse pool of sources. The developed approach minimizes computational and economic expenses while optimizing data privacy and accuracy. It incorporates prompt engineering and offline open-source LLMs, offering user autonomy in selecting base models. The study provides a thorough examination of text segmentation strategies, conducts comparative studies between LLMs, and explores various optimized prompts to demonstrate the effectiveness of the framework. By incorporating an external database, the framework outperforms a conventional LLM in generating accurate responses and constructing robust arguments. Additionally, the study delves into the investigation of optimized prompt templates for the purpose of efficient extraction of scientific literature. The research addresses concerns related to hallucinations and false research articles by introducing a custom workflow developed with a detection algorithm to filter out inaccuracies. Despite identified areas for improvement, the framework consistently delivers accurate domain-specific responses with minimal human oversight. The prompt-agnostic approach introduced holds promise for future deliberations. The study underscores the significance of integrating LLMs and knowledge processing techniques in scientific research, providing a foundation for advancements in data assimilation and utilization.
翻译:本研究探索将大型语言模型(LLMs)集成到科学数据同化中,以燃烧科学作为典型案例。通过将基础模型与检索增强生成(RAG)框架相结合,提出了一种处理实验研究、模拟和文献等多元燃烧研究数据的方法。燃烧研究的多面性凸显了知识处理在从庞大而多样的数据源中导航并提取有价值信息中的关键作用。所开发的方法在优化数据隐私和准确性的同时,最大程度降低了计算和经济成本。它整合了提示工程和离线的开源大语言模型,赋予用户选择基础模型的自主权。研究对文本分割策略进行了深入剖析,开展了大语言模型间的对比研究,并探索了多种优化提示以验证框架的有效性。通过引入外部数据库,该框架在生成准确回答和构建可靠论点方面优于传统大语言模型。此外,研究深入探讨了用于高效提取科学文献的优化提示模板。针对幻觉问题和虚假研究文章,研究引入了基于检测算法的定制工作流来过滤不准确信息。尽管存在待改进之处,该框架仍能在极少人工监督下持续生成准确的领域特定回答。所提出的提示无关方法对未来研究具有潜在价值。本研究强调了将大语言模型与知识处理技术整合到科学研究中的重要意义,为数据同化与利用的进步奠定了基础。