Automated international classification of diseases (ICD) coding aims to assign multiple disease codes to clinical documents and plays a critical role in healthcare informatics. However, its performance is hindered by the extreme long-tail distribution of the ICD ontology, where a few common codes dominate while thousands of rare codes have very few examples. To address this issue, we propose a Probability-Biased Directed Graph Attention model (ProBias) that partitions codes into common and rare sets and allows information to flow only from common to rare codes. Edge weights are determined by conditional co-occurrence probabilities, which guide the attention mechanism to enrich rare-code representations with clinically related signals. To provide higher-quality semantic representations as model inputs, we further employ large language models to generate enriched textual descriptions for ICD codes, offering external clinical context that complements statistical co-occurrence signals. Applied to automated ICD coding, our approach significantly improves the representation and prediction of rare codes, achieving state-of-the-art performance on three benchmark datasets. In particular, we observe substantial gains in macro-averaged F1 score, a key metric for long-tail classification.
翻译:自动化国际疾病分类编码旨在为临床文档分配多个疾病代码,在医疗信息学中发挥着关键作用。然而,其性能受到ICD本体极端长尾分布的制约:少数常见代码占据主导地位,而数千个罕见代码的样本极少。为解决此问题,我们提出了一种概率偏置有向图注意力模型,该模型将代码划分为常见集与罕见集,并仅允许信息从常见代码流向罕见代码。边权重由条件共现概率确定,从而引导注意力机制利用临床相关信号增强罕见代码的表征。为进一步提供更高质量的语义表征作为模型输入,我们采用大语言模型为ICD代码生成增强的文本描述,提供补充统计共现信号的外部临床上下文。将本方法应用于自动化ICD编码任务时,我们的方法显著改善了罕见代码的表征与预测能力,在三个基准数据集上取得了最先进的性能。特别值得注意的是,我们在宏观平均F1分数这一长尾分类关键指标上获得了显著提升。