Large language models (LLMs) offer substantial promise for automating clinical text summarization, yet maintaining factual consistency remains challenging due to the length, noise, and heterogeneity of clinical documentation. We present AgenticSum, an inference-time, agentic framework that separates context selection, generation, verification, and targeted correction to reduce hallucinated content. The framework decomposes summarization into coordinated stages that compress task-relevant context, generate an initial draft, identify weakly supported spans using internal attention grounding signals, and selectively revise flagged content under supervisory control. We evaluate AgenticSum on two public datasets, using reference-based metrics, LLM-as-a-judge assessment, and human evaluation. Across various measures, AgenticSum demonstrates consistent improvements compared to vanilla LLMs and other strong baselines. Our results indicate that structured, agentic design with targeted correction offers an effective inference time solution to improve clinical note summarization using LLMs.
翻译:大型语言模型(LLMs)为自动化临床文本摘要提供了巨大潜力,但由于临床文档的长度、噪声和异质性,保持事实一致性仍然具有挑战性。我们提出了AgenticSum,一种推理时的智能框架,通过分离上下文选择、生成、验证和针对性修正来减少幻觉内容。该框架将摘要任务分解为协调的多个阶段:压缩任务相关上下文、生成初始草稿、利用内部注意力基础信号识别弱支持片段,并在监督控制下有选择地修订被标记的内容。我们在两个公共数据集上评估AgenticSum,使用了基于参考的指标、LLM-as-a-judge评估和人工评估。在多种度量标准下,与原始LLMs及其他强基线相比,AgenticSum均展现出持续的改进。我们的结果表明,结合针对性修正的结构化智能设计,为利用LLMs改进临床笔记摘要提供了一种有效的推理时解决方案。