Reproducible Synthetic Clinical Letters for Seizure Frequency Information Extraction

Seizure-frequency information is important for epilepsy research and clinical care, but it is usually recorded in variable free-text clinic letters that are hard to annotate and share. We developed a reproducible, privacy-preserving framework for extracting seizure frequency using fully synthetic yet task-faithful epilepsy letters. We defined a structured label scheme covering common descriptions of seizure burden, including explicit rates, ranges, clusters, seizure-free intervals, unknown frequency, and explicit no-seizure statements. A teacher language model generated NHS-style synthetic letters paired with normalized labels, rationales, and evidence spans. We fine-tuned several open-weight language models (4B-14B parameters) on these synthetic letters to extract seizure frequency from full documents, comparing direct numeric prediction with structured label prediction and testing evidence-grounded outputs. On a clinician-checked held-out set of real clinic letters, models trained only on synthetic data generalized well, and structured labels consistently outperformed direct numeric regression. With 15,000 synthetic training letters, models achieved micro-F1 scores up to 0.788 for fine-grained categories and 0.847 for pragmatic categories; a medically oriented 4B model achieved 0.787 and 0.858, respectively. Evidence-grounded outputs also supported rapid clinical verification and error analysis. These results show that synthetic, structured, evidence-grounded supervision can enable robust seizure-frequency extraction without sharing sensitive patient text and may generalize to other temporally complex clinical information extraction tasks.

翻译：癫痫发作频率信息对于癫痫研究和临床护理至关重要，但此类信息通常记录在多变、难以标注和共享的自由文本临床信件中。我们开发了一个可复现且保护隐私的框架，利用完全合成但任务忠实的癫痫信件来提取癫痫发作频率。我们定义了一个结构化标签方案，涵盖癫痫负担的常见描述，包括明确频率、范围、丛集发作、无发作间隔、未知频率以及明确的无发作陈述。一个教师语言模型生成了NHS风格的合成信件，并配以标准化标签、推理依据和证据片段。我们在这些合成信件上微调了多个开放权重的语言模型（参数规模4B-14B），以从完整文档中提取癫痫发作频率，比较了直接数值预测与结构化标签预测，并测试了证据支撑的输出。在临床医生审核的真实临床信件留出测试集上，仅使用合成数据训练的模型展现出良好的泛化能力，且结构化标签方法持续优于直接数值回归。使用15,000封合成训练信件，模型在细粒度类别上取得了高达0.788的微平均F1分数，在实用类别上达到0.847；一个医学导向的4B参数模型分别取得了0.787和0.858的成绩。证据支撑的输出也有助于快速临床验证和错误分析。这些结果表明，合成的、结构化的、证据支撑的监督方法能够实现稳健的癫痫发作频率提取，而无需共享敏感的患者文本，并且可能推广到其他时间维度复杂的临床信息提取任务中。