From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity

Psychiatric comorbidity is clinically significant yet challenging due to the complexity of multiple co-occurring disorders. To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation. We create 502 synthetic EMRs for common comorbid conditions using a pipeline that ensures clinical relevance and diversity. Our multi-agent framework transfers the clinical interview protocol into a hierarchical state machine and context tree, supporting over 130 diagnostic states while maintaining clinical standards. Through this rigorous process, we construct PsyCoTalk, the first large-scale dialogue dataset supporting comorbidity, containing 3,000 multi-turn diagnostic dialogues validated by psychiatrists. This dataset enhances diagnostic accuracy and treatment planning, offering a valuable resource for psychiatric comorbidity research. Compared to real-world clinical transcripts, PsyCoTalk exhibits high structural and linguistic fidelity in terms of dialogue length, token distribution, and diagnostic reasoning strategies. Licensed psychiatrists confirm the realism and diagnostic validity of the dialogues. This dataset enables the development and evaluation of models capable of multi-disorder psychiatric screening in a single conversational pass.

翻译：精神共病具有显著的临床意义，但由于多种障碍共存的复杂性，其诊断面临挑战。为解决这一问题，我们提出了一种整合合成患者电子病历构建与多智能体诊断对话生成的新方法。我们采用一个确保临床相关性与多样性的流程，针对常见共病条件创建了502份合成电子病历。我们的多智能体框架将临床访谈规程转化为分层状态机与上下文树，在维持临床标准的同时支持超过130种诊断状态。通过这一严谨流程，我们构建了首个支持共病研究的大规模对话数据集PsyCoTalk，包含3000轮经精神科医师验证的多轮诊断对话。该数据集提升了诊断准确性与治疗规划能力，为精神共病研究提供了宝贵资源。与真实临床转录文本相比，PsyCoTalk在对话长度、词元分布及诊断推理策略方面均展现出高度的结构与语言保真度。执业精神科医师确认了对话的真实性与诊断有效性。该数据集支持能够在单次对话过程中进行多障碍精神筛查的模型的开发与评估。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【博士论文】结合图像与文本以提升医学图像理解

专知会员服务

30+阅读 · 2025年3月1日

利用表示学习推动多机构电子健康记录数据研究

专知会员服务

16+阅读 · 2025年2月17日

Cancer Cell综述｜AI用于肿瘤学中的多模态数据集成

专知会员服务

35+阅读 · 2022年10月13日

【Scientific Reports】《多中心影像诊断的联邦学习：心血管疾病的模拟研究》

专知会员服务

20+阅读 · 2022年8月4日