Automatic identification of diagnosis from hospital discharge letters via weakly supervised Natural Language Processing

Identifying patient diagnoses from hospital discharge letters is essential for large-scale cohort selection and epidemiological research, but traditional supervised approaches require extensive manual annotation, which is often impractical for large textual datasets. We present a weakly supervised Natural Language Processing (NLP) pipeline for classifying Italian discharge letters without document-level manual annotation. The method extracts diagnosis-related sentences, generates semantic embeddings using a transformer model further pre-trained on Italian medical documents, and applies a two-level clustering procedure to derive weak labels that are then used to train a document-level classifier. The approach was evaluated in a case study on bronchiolitis using 33,176 discharge letters of children admitted to 44 emergency rooms or hospitals in the Veneto Region, Italy, between 2017 and 2020. The best weakly supervised model achieved an AUROC of 77.68% ($\pm4.30\%$), an AUPRC of 73.13% ($\pm4.93\%$), and an F1-score of 78.14% ($\pm4.89\%$) against manually annotated data. Performance surpassed unsupervised baselines and approached fully supervised models, while reducing the need for manual annotation by more than 1,500 hours for a dataset of this size. Similar model rankings were observed in a secondary validation on a smaller bronchitis dataset (3,188 discharge letters, 2020-2025), where the best weakly supervised model achieved an AUPRC of 76.72% ($\pm 5.02\%$). These results suggest the potential of weakly supervised NLP methods for scalable disease identification from clinical discharge letters.

翻译：从医院出院信识别患者诊断对于大规模队列选择和流行病学研究至关重要，但传统监督方法需要大量人工标注，这在处理大规模文本数据集时往往不切实际。我们提出了一种弱监督自然语言处理（NLP）流水线，用于对意大利语出院信进行分类，无需文档级人工标注。该方法提取与诊断相关的句子，使用进一步在意大利语医学文档上预训练的变换器模型生成语义嵌入，并应用两级聚类过程推导弱标签，随后用于训练文档级分类器。该方法的评估基于一项关于细支气管炎的案例研究，涉及2017年至2020年间意大利威尼托大区44家急诊室或医院收治的33,176份儿童出院信。最佳弱监督模型在手动标注数据上达到了77.68%（±4.30%）的AUROC、73.13%（±4.93%）的AUPRC和78.14%（±4.89%）的F1分数。其性能超越了无监督基线，并接近全监督模型，同时对于该规模的数据集，减少了超过1,500小时的人工标注需求。在针对更小子集支气管炎数据集（3,188份出院信，2020-2025年）的二次验证中，观察到相似的模型排名，其中最佳弱监督模型达到了76.72%（±5.02%）的AUPRC。这些结果表明，弱监督NLP方法在从临床出院信中实现可扩展的疾病识别方面具有潜力。