A Semi-Automated Annotation Workflow for Paediatric Histopathology Reports Using Small Language Models

Avish Vijayaraghavan,Jaskaran Singh Kawatra,Sebin Sabu,Jonny Sheldon,Will Poulett,Alex Eze,Daniel Key,John Booth,Shiren Patel,Jonny Pearson,Dan Schofield,Jonathan Hope,Pavithra Rajendran,Neil Sebire

from arxiv, 36 pages, includes supplementary information

Electronic Patient Record (EPR) systems contain valuable clinical information, but much of it is trapped in unstructured text, limiting its use for research and decision-making. Large language models can extract such information but require substantial computational resources to run locally, and sending sensitive clinical data to cloud-based services, even when deidentified, raises significant patient privacy concerns. In this study, we develop a resource-efficient semi-automated annotation workflow using small language models (SLMs) to extract structured information from unstructured EPR data, focusing on paediatric histopathology reports. As a proof-of-concept, we apply the workflow to paediatric renal biopsy reports, a domain chosen for its constrained diagnostic scope and well-defined underlying biology. We develop the workflow iteratively with clinical oversight across three meetings, manually annotating 400 reports from a dataset of 2,111 at Great Ormond Street Hospital as a gold standard, while developing an automated information extraction approach using SLMs. We frame extraction as a Question-Answering task grounded by clinician-guided entity guidelines and few-shot examples, evaluating five instruction-tuned SLMs with a disagreement modelling framework to prioritise reports for clinical review. Gemma 2 2B achieves the highest accuracy at 84.3%, outperforming off-the-shelf models including spaCy (74.3%), BioBERT-SQuAD (62.3%), RoBERTa-SQuAD (59.7%), and GLiNER (60.2%). Entity guidelines improved performance by 7-19% over the zero-shot baseline, and few-shot examples by 6-38%, though their benefits do not compound when combined. These results demonstrate that SLMs can extract structured information from specialised clinical domains on CPU-only infrastructure with minimal clinician involvement. Our code is available at https://github.com/gosh-dre/nlp_renal_biopsy.

翻译：电子病历（EPR）系统包含大量有价值的临床信息，但这些信息大多被锁定在非结构化文本中，限制了其在研究及决策中的应用。大型语言模型虽能提取此类信息，但本地运行时需巨额计算资源，而将敏感临床数据（即使经过去标识化处理）传输至云端服务会引发严重的患者隐私担忧。本研究针对儿科组织病理学报告，开发了一种资源高效的半自动化注释工作流，利用小型语言模型（SLMs）从非结构化EPR数据中提取结构化信息。作为概念验证，我们将其应用于儿科肾活检报告——该领域因诊断范围有限且潜在病理机制明确而获选。通过三场临床监督会议迭代优化工作流，从大奥蒙德街医院2111份报告中人工标注400份作为金标准，同时基于SLMs开发自动化信息提取方法。我们将提取任务建模为问答任务，基于临床指导的实体指南与少样本示例，采用分歧建模框架评估五种指令微调SLM，以优先筛选需临床复核的报告。其中Gemma 2 2B准确率最高达84.3%，优于spaCy（74.3%）、BioBERT-SQuAD（62.3%）、RoBERTa-SQuAD（59.7%）和GLiNER（60.2%）等现成模型。与零样本基准相比，实体指南使性能提升7–19%，少样本示例提升6–38%，但二者组合时并未产生叠加效应。这些结果表明，SLM能在仅依赖CPU的基础设施上，以最少的临床参与从专业临床领域提取结构化信息。我们的代码已开源：https://github.com/gosh-dre/nlp_renal_biopsy