Retrieval Augmented Generation of Literature-derived Polymer Knowledge: The Example of a Biodegradable Polymer Expert System

Polymer literature contains a large and growing body of experimental knowledge, yet much of it is buried in unstructured text and inconsistent terminology, making systematic retrieval and reasoning difficult. Existing tools typically extract narrow, study-specific facts in isolation, failing to preserve the cross-study context required to answer broader scientific questions. Retrieval-augmented generation (RAG) offers a promising way to overcome this limitation by combining large language models (LLMs) with external retrieval, but its effectiveness depends strongly on how domain knowledge is represented. In this work, we develop two retrieval pipelines: a dense semantic vector-based approach (VectorRAG) and a graph-based approach (GraphRAG). Using over 1,000 polyhydroxyalkanoate (PHA) papers, we construct context-preserving paragraph embeddings and a canonicalized structured knowledge graph supporting entity disambiguation and multi-hop reasoning. We evaluate these pipelines through standard retrieval metrics, comparisons with general state-of-the-art systems such as GPT and Gemini, and qualitative validation by a domain chemist. The results show that GraphRAG achieves higher precision and interpretability, while VectorRAG provides broader recall, highlighting complementary trade-offs. Expert validation further confirms that the tailored pipelines, particularly GraphRAG, produce well-grounded, citation-reliable responses with strong domain relevance. By grounding every statement in evidence, these systems enable researchers to navigate the literature, compare findings across studies, and uncover patterns that are difficult to extract manually. More broadly, this work establishes a practical framework for building materials science assistants using curated corpora and retrieval design, reducing reliance on proprietary models while enabling trustworthy literature analysis at scale.

翻译：聚合物文献包含大量且不断增长的实验知识，然而其中大部分知识埋藏在非结构化文本和不一致的术语中，使得系统性检索与推理变得困难。现有工具通常孤立地提取狭窄的、特定研究的事实，未能保留回答更广泛科学问题所需的跨研究上下文。检索增强生成通过将大语言模型与外部检索相结合，为克服这一局限提供了有前景的途径，但其有效性在很大程度上取决于领域知识的表示方式。在本工作中，我们开发了两种检索流程：一种基于稠密语义向量的方法和一种基于图的方法。利用超过1000篇聚羟基脂肪酸酯论文，我们构建了保留上下文的段落嵌入以及一个支持实体消歧和多跳推理的规范化结构化知识图谱。我们通过标准检索指标、与GPT和Gemini等通用最先进系统的比较，以及领域化学家的定性验证来评估这些流程。结果表明，GraphRAG实现了更高的精确度和可解释性，而VectorRAG提供了更广泛的召回率，突显了互补的权衡。专家验证进一步证实，定制的流程，特别是GraphRAG，能够产生有充分依据、引用可靠且具有强领域相关性的回答。通过将每个陈述都建立在证据之上，这些系统使研究人员能够导航文献、比较不同研究的结果，并发现难以手动提取的模式。更广泛而言，这项工作为使用精选语料库和检索设计构建材料科学助手建立了一个实用框架，减少了对专有模型的依赖，同时实现了可信任的大规模文献分析。