Scientific reasoning increasingly requires linking structured experimental data with the unstructured literature that explains it, yet most large language model (LLM) assistants cannot reason jointly across these modalities. We introduce SpectraQuery, a hybrid natural-language query framework that integrates a relational Raman spectroscopy database with a vector-indexed scientific literature corpus using a Structured and Unstructured Query Language (SUQL)-inspired design. By combining semantic parsing with retrieval-augmented generation, SpectraQuery translates open-ended questions into coordinated SQL and literature retrieval operations, producing cited answers that unify numerical evidence with mechanistic explanation. Across SQL correctness, answer groundedness, retrieval effectiveness, and expert evaluation, SpectraQuery demonstrates strong performance: approximately 80 percent of generated SQL queries are fully correct, synthesized answers reach 93-97 percent groundedness with 10-15 retrieved passages, and battery scientists rate responses highly across accuracy, relevance, grounding, and clarity (4.1-4.6/5). These results show that hybrid retrieval architectures can meaningfully support scientific workflows by bridging data and discourse for high-volume experimental datasets.
翻译:科学推理日益需要将结构化实验数据与解释这些数据的非结构化文献相连接,然而大多数大型语言模型(LLM)助手无法跨这些模态进行联合推理。我们提出了SpectraQuery,一种混合自然语言查询框架,它采用受结构化与非结构化查询语言(SUQL)启发的设计,将关系型拉曼光谱数据库与向量索引的科学文献语料库相集成。通过将语义解析与检索增强生成相结合,SpectraQuery将开放式问题转化为协调的SQL与文献检索操作,生成融合数值证据与机理解释的引用答案。在SQL正确性、答案可验证性、检索有效性及专家评估方面,SpectraQuery均表现出优异性能:约80%生成的SQL查询完全正确,综合答案在检索10-15个段落的情况下可达到93-97%的可验证性,电池科学家在准确性、相关性、可验证性和清晰度方面对回答给予高度评价(4.1-4.6/5分)。这些结果表明,混合检索架构能够通过为海量实验数据集搭建数据与论述之间的桥梁,有效支持科学工作流程。