Electronic Patient Record (EPR) systems contain valuable clinical information, but much of it is trapped in unstructured text, limiting its use for research and decision-making. Large language models can extract such information but require substantial computational resources to run locally, and sending sensitive clinical data to cloud-based services, even when deidentified, raises significant patient privacy concerns. In this study, we develop a resource-efficient semi-automated annotation workflow using small language models (SLMs) to extract structured information from unstructured EPR data, focusing on paediatric histopathology reports. As a proof-of-concept, we apply the workflow to paediatric renal biopsy reports, a domain chosen for its constrained diagnostic scope and well-defined underlying biology. We develop the workflow iteratively with clinical oversight across three meetings, manually annotating 400 reports from a dataset of 2,111 at Great Ormond Street Hospital as a gold standard, while developing an automated information extraction approach using SLMs. We frame extraction as a Question-Answering task grounded by clinician-guided entity guidelines and few-shot examples, evaluating five instruction-tuned SLMs with a disagreement modelling framework to prioritise reports for clinical review. Gemma 2 2B achieves the highest accuracy at 84.3%, outperforming off-the-shelf models including spaCy (74.3%), BioBERT-SQuAD (62.3%), RoBERTa-SQuAD (59.7%), and GLiNER (60.2%). Entity guidelines improved performance by 7-19% over the zero-shot baseline, and few-shot examples by 6-38%, though their benefits do not compound when combined. These results demonstrate that SLMs can extract structured information from specialised clinical domains on CPU-only infrastructure with minimal clinician involvement. Our code is available at https://github.com/gosh-dre/nlp_renal_biopsy.
翻译:电子病历(EPR)系统包含大量有价值的临床信息,但这些信息大多被锁定在非结构化文本中,限制了其在研究及决策中的应用。大型语言模型虽能提取此类信息,但本地运行时需巨额计算资源,而将敏感临床数据(即使经过去标识化处理)传输至云端服务会引发严重的患者隐私担忧。本研究针对儿科组织病理学报告,开发了一种资源高效的半自动化注释工作流,利用小型语言模型(SLMs)从非结构化EPR数据中提取结构化信息。作为概念验证,我们将其应用于儿科肾活检报告——该领域因诊断范围有限且潜在病理机制明确而获选。通过三场临床监督会议迭代优化工作流,从大奥蒙德街医院2111份报告中人工标注400份作为金标准,同时基于SLMs开发自动化信息提取方法。我们将提取任务建模为问答任务,基于临床指导的实体指南与少样本示例,采用分歧建模框架评估五种指令微调SLM,以优先筛选需临床复核的报告。其中Gemma 2 2B准确率最高达84.3%,优于spaCy(74.3%)、BioBERT-SQuAD(62.3%)、RoBERTa-SQuAD(59.7%)和GLiNER(60.2%)等现成模型。与零样本基准相比,实体指南使性能提升7–19%,少样本示例提升6–38%,但二者组合时并未产生叠加效应。这些结果表明,SLM能在仅依赖CPU的基础设施上,以最少的临床参与从专业临床领域提取结构化信息。我们的代码已开源:https://github.com/gosh-dre/nlp_renal_biopsy