Clinical narratives encode temporal dynamics essential for modeling patient trajectories, yet large-scale temporally annotated resources are scarce. We introduce PMOA-TTS, a corpus of 124,699 single-patient PubMed Open Access case reports converted into structured textual timelines of (event, time) pairs using a scalable large-language-model pipeline (Llama 3.3 70B and DeepSeek-R1). The corpus comprises over 5.6 million timestamped events, alongside extracted demographics and diagnoses. Technical validation uses a clinician-curated gold set and three measures: semantic event matching, temporal concordance (c-index), and alignment error summarized with Area Under the Log-Time CDF (AULTC). We benchmark alternative prompting and model choices and provide documentation to support reproduction. PMOA-TTS enables research on timeline extraction, temporal reasoning, survival modeling and event forecasting from narrative text, and offers broad diagnostic and demographic coverage. Data and code are openly available in public repositories.
翻译:临床叙事编码了对建模患者轨迹至关重要的时间动态,然而大规模的时间标注资源十分稀缺。我们介绍了PMOA-TTS,这是一个包含124,699份单患者PubMed开放获取病例报告的语料库,这些报告通过一个可扩展的大语言模型流程(Llama 3.3 70B 和 DeepSeek-R1)被转换为结构化的(事件,时间)对文本时间线。该语料库包含超过560万个带时间戳的事件,以及提取的人口统计学信息和诊断信息。技术验证使用了一个由临床医生策划的金标准集和三项指标:语义事件匹配、时间一致性(c-index)以及用对数时间累积分布函数下面积(AULTC)汇总的对齐误差。我们对替代提示策略和模型选择进行了基准测试,并提供了支持复现的文档。PMOA-TTS支持从叙事文本中进行时间线提取、时序推理、生存建模和事件预测的研究,并提供了广泛的诊断和人口统计学覆盖范围。数据和代码已在公共存储库中开放获取。