Pathology reports are rich in clinical and pathological details but are often presented in free-text format. The unstructured nature of these reports presents a significant challenge limiting the accessibility of their content. In this work, we present a practical approach based on the use of large multimodal models (LMMs) for automatically extracting information from scanned images of pathology reports with the goal of generating a standardised report specifying the value of different fields along with estimated confidence about the accuracy of the extracted fields. The proposed approach overcomes limitations of existing methods which do not assign confidence scores to extracted fields limiting their practical use. The proposed framework uses two stages of prompting a Large Multimodal Model (LMM) for information extraction and validation. The framework generalises to textual reports from multiple medical centres as well as scanned images of legacy pathology reports. We show that the estimated confidence is an effective indicator of the accuracy of the extracted information that can be used to select only accurately extracted fields. We also show the prognostic significance of structured and unstructured data from pathology reports and show that the automatically extracted field values significant prognostic value for patient stratification. The framework is available for evaluation via the URL: https://labieb.dcs.warwick.ac.uk/.
翻译:病理报告包含丰富的临床和病理细节,但通常以自由文本形式呈现。这类报告的非结构化特性严重限制了其内容的可获取性。本研究提出了一种基于大型多模态模型(LMM)的实用方法,用于自动从病理报告扫描图像中提取信息,旨在生成标准化报告,明确各字段取值及其提取准确性的估计置信度。该方法克服了现有方法因未对提取字段赋予置信度分数而限制其实际应用的缺陷。所提框架通过两阶段提示大型多模态模型实现信息提取与验证,可泛化至来自多个医疗中心的文本报告及历史病理报告的扫描图像。研究表明,估计置信度可作为提取信息准确性的有效指标,用于筛选准确提取的字段。此外,我们验证了病理报告中的结构化和非结构化数据均具有预后意义,证明自动提取的字段值对患者分层具有显著预后价值。该框架可通过网址 https://labieb.dcs.warwick.ac.uk/ 进行评估。