A Quick, trustworthy spectral knowledge Q&A system leveraging retrieval-augmented generation on LLM

Large Language Model (LLM) has demonstrated significant success in a range of natural language processing (NLP) tasks within general domain. The emergence of LLM has introduced innovative methodologies across diverse fields, including the natural sciences. Researchers aim to implement automated, concurrent process driven by LLM to supplant conventional manual, repetitive and labor-intensive work. In the domain of spectral analysis and detection, it is imperative for researchers to autonomously acquire pertinent knowledge across various research objects, which encompasses the spectroscopic techniques and the chemometric methods that are employed in experiments and analysis. Paradoxically, despite the recognition of spectroscopic detection as an effective analytical method, the fundamental process of knowledge retrieval remains both time-intensive and repetitive. In response to this challenge, we first introduced the Spectral Detection and Analysis Based Paper(SDAAP) dataset, which is the first open-source textual knowledge dataset for spectral analysis and detection and contains annotated literature data as well as corresponding knowledge instruction data. Subsequently, we also designed an automated Q\&A framework based on the SDAAP dataset, which can retrieve relevant knowledge and generate high-quality responses by extracting entities in the input as retrieval parameters. It is worth noting that: within this framework, LLM is only used as a tool to provide generalizability, while RAG technique is used to accurately capture the source of the knowledge.This approach not only improves the quality of the generated responses, but also ensures the traceability of the knowledge. Experimental results show that our framework generates responses with more reliable expertise compared to the baseline.

翻译：大语言模型（LLM）在通用领域的多项自然语言处理（NLP）任务中已展现出显著成效。LLM的出现为包括自然科学在内的众多领域带来了创新方法。研究人员致力于利用LLM驱动的自动化并发流程，以替代传统人工、重复且劳动密集型的工作。在光谱分析与检测领域，研究人员亟需自主获取针对不同研究对象的关联知识，这涵盖实验与分析中所采用的光谱技术与化学计量学方法。然而矛盾的是，尽管光谱检测被公认为一种有效的分析方法，其核心的知识检索过程却依然耗时且重复。为应对这一挑战，我们首先构建了首个面向光谱分析与检测的开源文本知识数据集——基于光谱检测与分析的论文（SDAAP）数据集，该数据集包含标注文献数据及相应的知识指令数据。随后，我们基于SDAAP数据集设计了一套自动化问答框架，该框架可通过提取输入中的实体作为检索参数，检索相关知识并生成高质量回答。值得注意的是：在此框架中，LLM仅作为提供泛化能力的工具，而RAG技术则用于精准追溯知识来源。该方法不仅提升了生成回答的质量，同时确保了知识的可追溯性。实验结果表明，相较于基线模型，我们的框架能够生成具有更高专业可信度的回答。