Clinical case reports and discharge summaries may be the most complete and accurate summarization of patient encounters, yet they are finalized, i.e., timestamped after the encounter. Complementary structured data streams become available sooner but suffer from incompleteness. To train models and algorithms on more complete and temporally fine-grained data, we construct a pipeline to phenotype, extract, and annotate time-localized findings within case reports using large language models. We apply our pipeline to generate an open-access textual time series corpus for Sepsis-3 comprising 2,139 case reports from the PubMed-Open Access (PMOA) Subset. To validate our system, we apply it to PMOA and timeline annotations from i2b2/MIMIC-IV and compare the results to physician-expert annotations. We show high recovery rates of clinical findings (event match rates: GPT-5--0.93, Llama 3.3 70B Instruct--0.76) and strong temporal ordering (concordance: GPT-5--0.965, Llama 3.3 70B Instruct--0.908). Our work characterizes the ability of LLMs to time-localize clinical findings in text, illustrating the limitations of LLM use for temporal reconstruction and providing several potential avenues of improvement via multimodal integration.
翻译:临床病例报告和出院小结可能是对患者就诊过程最完整、最准确 的总结,但它们在就诊结束后才最终定稿,即带有时间戳。互补的结构化数据流虽然更早可用,但存在不完整性。为了在更完整且时间粒度更细的数据上训练模型和算法,我们构建了一个流水线,利用大语言模型对病例报告中的时间定位发现进行表型分析、提取和注释。我们应用该流水线生成了一个开放获取的脓毒症-3文本时间序列语料库,包含来自PubMed开放获取子集的2,139份病例报告。为验证系统性能,我们将其应用于PMOA及i2b2/MIMIC-IV的时间线注释,并将结果与医学专家注释进行对比。结果显示临床发现的高回收率(事件匹配率:GPT-5--0.93,Llama 3.3 70B Instruct--0.76)和强时间排序一致性(一致性指数:GPT-5--0.965,Llama 3.3 70B Instruct--0.908)。本研究揭示了大语言模型在文本中对临床发现进行时间定位的能力,说明了其在时间重建应用中的局限性,并提出了通过多模态集成进行改进的若干潜在方向。