In this study, we evaluate a locally-deployed large-language model (LLM) to convert unstructured endometriosis transvaginal ultrasound (eTVUS) scan reports into structured data for imaging informatics workflows. Across 49 eTVUS reports, we compared three LLMs (7B/8B and a 20B-parameter model) against expert human extraction. The 20B model achieved a mean accuracy of 86.02%, substantially outperforming smaller models and confirming the importance of scale in handling complex clinical text. Crucially, we identified a highly complementary error profile: the LLM excelled at syntactic consistency (e.g., date/numeric formatting) where humans faltered, while human experts provided superior semantic and contextual interpretation. We also found that the LLM's semantic errors were fundamental limitations that could not be mitigated by simple prompt engineering. These findings strongly support a human-in-the-loop (HITL) workflow in which the on-premise LLM serves as a collaborative tool, not a full replacement. It automates routine structuring and flags potential human errors, enabling imaging specialists to focus on high-level semantic validation. We discuss implications for structured reporting and interactive AI systems in clinical practice.
翻译:本研究评估了一种本地部署的大型语言模型(LLM),用于将非结构化的子宫内膜异位症经阴道超声(eTVUS)扫描报告转换为结构化数据,以支持影像信息学工作流程。基于49份eTVUS报告,我们比较了三种LLM(7B/8B参数模型和一种200亿参数模型)与专家人工提取的效果。200亿参数模型的平均准确率达到86.02%,显著优于较小规模模型,证实了模型规模在处理复杂临床文本中的重要性。关键的是,我们发现了一种高度互补的错误模式:LLM在人类易出错的句法一致性(如日期/数字格式)方面表现优异,而人类专家在语义和上下文解释方面更具优势。我们还发现,LLM的语义错误是其固有局限,无法通过简单的提示工程来缓解。这些发现有力地支持了人在回路(HITL)工作流程,其中本地部署的LLM作为协作工具而非完全替代方案。它自动化常规结构化处理并标记潜在人为错误,使影像专家能够专注于高层语义验证。我们讨论了结构化报告和交互式AI系统在临床实践中的应用前景。