GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

Graph-based Retrieval Augmented Generation (GraphRAG) extends retrieval-augmented generation to support structured reasoning over complex corpora, but its reliability under resource-constrained, privacy-sensitive deployments remains unclear. In healthcare, where Electronic Health Record (EHR) data is complex and strictly regulated, reliance on cloud-based large language models (LLMs) introduces challenges in cost, latency, and compliance. In this work, we present a systematic evaluation of GraphRAG for EHR schema retrieval using locally deployed open-source LLMs. We implement the Microsoft GraphRAG pipeline on real-world EHR schema documentation and benchmark four models, including Llama 3.1 (8B), Mistral (7B), Qwen 2.5 (7B), and Phi-4-mini (3.8B), each deployed via Ollama on a single consumer GPU (8 GB VRAM). We evaluate indexing efficiency, knowledge graph construction, query latency, answer quality, and hallucination under both global and local retrieval modes. Our results reveal substantial differences: Llama 3.1 produces the richest knowledge graph (1,172 entities), Qwen 2.5 achieves the best answer quality (3.3/5), Phi-4-mini fails to complete the pipeline due to structured-output errors, and Mistral exhibits degenerate repetition behavior. We further show that GraphRAG exhibits a practical capacity threshold, where models below approximately 7B parameters fail to reliably produce valid structured outputs and cannot complete the pipeline. In addition, indexing and answer quality are decoupled across models, and local retrieval consistently outperforms global summarization in both latency and factual grounding, with reduced hallucination. These findings demonstrate that GraphRAG is feasible on consumer hardware while highlighting the importance of model selection and retrieval design for robust deployment in regulated settings.

翻译：基于图结构的检索增强生成（GraphRAG）通过扩展检索增强生成技术以支持复杂语料库的结构化推理，但在资源受限且注重隐私保护的部署场景下，其可靠性尚未明确。在医疗领域，电子健康记录（EHR）数据既复杂又受严格监管，依赖云端大语言模型（LLM）会带来成本、延迟和合规性方面的挑战。本研究系统评估了基于本地部署开源LLM的GraphRAG在EHR模式检索中的表现。我们采用微软GraphRAG框架处理真实EHR模式文档，并通过Ollama在单张消费级GPU（8 GB显存）上部署四类模型进行基准测试：Llama 3.1（8B）、Mistral（7B）、Qwen 2.5（7B）和Phi-4-mini（3.8B）。我们从全局/局部两种检索模式出发，评估索引效率、知识图谱构建、查询延迟、回答质量及幻觉现象。结果表明模型间存在显著差异：Llama 3.1生成最丰富的知识图谱（1,172个实体），Qwen 2.5取得最佳回答质量（3.3/5分），Phi-4-mini因结构化输出错误导致流水线中断，而Mistral出现退化性重复行为。进一步研究发现，GraphRAG存在实际容量阈值——参数规模低于约7B的模型无法可靠生成有效结构化输出，导致流水线无法完整运行。此外，模型间的索引效率与回答质量呈现解耦特征，局部检索在延迟和事实依据两方面均持续优于全局摘要，且幻觉现象显著减少。这些发现证明GraphRAG可在消费级硬件上运行，同时突显模型选择与检索设计对受监管环境稳定部署的关键作用。