In this study, we evaluate a locally-deployed large-language model (LLM) to convert unstructured endometriosis transvaginal ultrasound (eTVUS) scan reports into structured data for imaging informatics workflows. Across 49 eTVUS reports, we compared three LLMs (7B/8B and a 20B-parameter model) against expert human extraction. The 20B model achieved a mean accuracy of 86.02%, substantially outperforming smaller models and confirming the importance of scale in handling complex clinical text. Crucially, we identified a highly complementary error profile: the LLM excelled at syntactic consistency (e.g., date/numeric formatting) where humans faltered, while human experts provided superior semantic and contextual interpretation. We also found that the LLM's semantic errors were fundamental limitations that could not be mitigated by simple prompt engineering. These findings strongly support a human-in-the-loop (HITL) workflow in which the on-premise LLM serves as a collaborative tool, not a full replacement. It automates routine structuring and flags potential human errors, enabling imaging specialists to focus on high-level semantic validation. We discuss implications for structured reporting and interactive AI systems in clinical practice.
翻译:本研究评估了一种本地部署的大型语言模型(LLM),用于将非结构化的子宫内膜异位症经阴道超声(eTVUS)扫描报告转换为结构化数据,以支持影像信息学工作流程。通过对49份eTVUS报告的分析,我们比较了三种LLM(7B/8B参数模型及一个200亿参数模型)与专家人工提取的性能。200亿参数模型取得了86.02%的平均准确率,显著优于较小规模模型,证实了模型规模在处理复杂临床文本中的重要性。关键发现是,两者呈现出高度互补的错误特征:LLM在人类易出错的句法一致性任务(如日期/数字格式化)上表现优异,而人类专家在语义与上下文解读方面更具优势。研究同时发现,LLM的语义错误属于根本性局限,无法通过简单的提示工程缓解。这些发现有力支持了人在回路(HITL)工作流程,即本地部署的LLM应作为协作工具而非完全替代方案。该方案可自动化常规结构化处理并标记潜在人为错误,使影像专家能专注于高层级语义验证。本文进一步探讨了结构化报告与交互式AI系统在临床实践中的应用前景。