A Semi-Automated Annotation Workflow for Paediatric Histopathology Reports Using Small Language Models

Avish Vijayaraghavan,Jaskaran Singh Kawatra,Sebin Sabu,Jonny Sheldon,Will Poulett,Alex Eze,Daniel Key,John Booth,Shiren Patel,Jonny Pearson,Dan Schofield,Jonathan Hope,Pavithra Rajendran,Neil Sebire

from arxiv, 36 pages, includes supplementary information

Electronic Patient Record (EPR) systems contain valuable clinical information, but much of it is trapped in unstructured text, limiting its use for research and decision-making. Large language models can extract such information but require substantial computational resources to run locally, and sending sensitive clinical data to cloud-based services, even when deidentified, raises significant patient privacy concerns. In this study, we develop a resource-efficient semi-automated annotation workflow using small language models (SLMs) to extract structured information from unstructured EPR data, focusing on paediatric histopathology reports. As a proof-of-concept, we apply the workflow to paediatric renal biopsy reports, a domain chosen for its constrained diagnostic scope and well-defined underlying biology. We develop the workflow iteratively with clinical oversight across three meetings, manually annotating 400 reports from a dataset of 2,111 at Great Ormond Street Hospital as a gold standard, while developing an automated information extraction approach using SLMs. We frame extraction as a Question-Answering task grounded by clinician-guided entity guidelines and few-shot examples, evaluating five instruction-tuned SLMs with a disagreement modelling framework to prioritise reports for clinical review. Gemma 2 2B achieves the highest accuracy at 84.3%, outperforming off-the-shelf models including spaCy (74.3%), BioBERT-SQuAD (62.3%), RoBERTa-SQuAD (59.7%), and GLiNER (60.2%). Entity guidelines improved performance by 7-19% over the zero-shot baseline, and few-shot examples by 6-38%, though their benefits do not compound when combined. These results demonstrate that SLMs can extract structured information from specialised clinical domains on CPU-only infrastructure with minimal clinician involvement. Our code is available at https://github.com/gosh-dre/nlp_renal_biopsy.

翻译：电子病历系统中包含大量有价值的临床信息，但这些信息大多以非结构化文本形式存在，限制了其在科研与临床决策中的应用。大型语言模型虽能提取此类信息，但本地运行需要大量计算资源，而将敏感临床数据发送至云端服务（即使经过脱敏处理）会引发重大患者隐私问题。本研究开发了一种资源高效的半自动标注流程，利用小型语言模型从非结构化电子病历数据中提取结构化信息，聚焦于儿童组织病理学报告。作为概念验证，我们将该流程应用于儿童肾活检报告——该领域因诊断范围明确且生物学基础清晰而入选。我们通过三次临床督导研讨会迭代开发该流程，以伦敦大奥蒙德街医院2111份报告数据集中人工标注的400份报告作为金标准，同时基于小型语言模型构建自动化信息提取方法。我们将提取任务建模为临床医师指导实体准则与少样本示例驱动的问答任务，通过分歧建模框架评估五种指令微调后的小型语言模型，以优先筛选需临床复核的报告。数据显示，Gemma 2 2B模型以84.3%的准确率最优，其性能显著优于包括spaCy（74.3%）、BioBERT-SQuAD（62.3%）、RoBERTa-SQuAD（59.7%）和GLiNER（60.2%）在内的现成模型。实体准则使性能较零样本基线提升7-19%，少样本示例提升6-38%，但二者组合时效果不叠加。研究结果表明，小型语言模型可通过仅依赖CPU的底层设施实现临床专业领域结构化信息提取，且仅需最低程度的临床专家参与。相关代码已开源：https://github.com/gosh-dre/nlp_renal_biopsy。