Detecting the Clinical Features of Difficult-to-Treat Depression using Synthetic Data from Large Language Models

Difficult-to-treat depression (DTD) has been proposed as a broader and more clinically comprehensive perspective on a person's depressive disorder where despite treatment, they continue to experience significant burden. We sought to develop a Large Language Model (LLM)-based tool capable of interrogating routinely-collected, narrative (free-text) electronic health record (EHR) data to locate published prognostic factors that capture the clinical syndrome of DTD. In this work, we use LLM-generated synthetic data (GPT3.5) and a Non-Maximum Suppression (NMS) algorithm to train a BERT-based span extraction model. The resulting model is then able to extract and label spans related to a variety of relevant positive and negative factors in real clinical data (i.e. spans of text that increase or decrease the likelihood of a patient matching the DTD syndrome). We show it is possible to obtain good overall performance (0.70 F1 across polarity) on real clinical data on a set of as many as 20 different factors, and high performance (0.85 F1 with 0.95 precision) on a subset of important DTD factors such as history of abuse, family history of affective disorder, illness severity and suicidality by training the model exclusively on synthetic data. Our results show promise for future healthcare applications especially in applications where traditionally, highly confidential medical data and human-expert annotation would normally be required.

翻译：难治性抑郁症（DTD）被提出作为一种更广泛且更具临床综合性的视角，用以描述尽管接受治疗，患者仍持续承受显著负担的抑郁障碍。我们旨在开发一种基于大语言模型（LLM）的工具，能够分析常规收集的非结构化（自由文本）电子健康记录（EHR）数据，以定位可捕捉DTD临床综合征的已知预后因素。在本研究中，我们利用LLM生成的合成数据（GPT3.5）和非极大值抑制（NMS）算法，训练了一个基于BERT的跨度提取模型。由此得到的模型能够从真实临床数据中提取并标注与多种相关正负因素相关的文本跨度（即增加或减少患者符合DTD综合征可能性的文本片段）。我们证明，仅通过合成数据训练模型，即可在多达20种不同因素的真实临床数据上获得良好的整体性能（跨极性的F1值为0.70），并在诸如虐待史、情感障碍家族史、疾病严重程度及自杀倾向等重要DTD因素的子集上实现高精度（F1值为0.85，精确度为0.95）。我们的结果表明，该方法在未来的医疗应用（尤其是传统上需要高度机密医疗数据和人类专家标注的场景）中具有广阔前景。

相关内容

DTD

关注 37

文档类型定义(Document Type Definition)是一套为了进行程序间的数据交换而建立的关于标记符的语法规则。它是标准通用标记语言和可扩展标记语言1.0版规格的一部分，文档可根据某种DTD语法规则验证格式是否符合此规则。文档类型定义也可用做保证标准通用标记语言、可扩展标记语言文档格式的合法性，可通过比较文档和文档类型定义文件来检查文档是否符合规范，元素和标签使用是否正确。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日