Improving ICD-based semantic similarity by accounting for varying degrees of comorbidity

Finding similar patients is a common objective in precision medicine, facilitating treatment outcome assessment and clinical decision support. Choosing widely-available patient features and appropriate mathematical methods for similarity calculations is crucial. International Statistical Classification of Diseases and Related Health Problems (ICD) codes are used worldwide to encode diseases and are available for nearly all patients. Aggregated as sets consisting of primary and secondary diagnoses they can display a degree of comorbidity and reveal comorbidity patterns. It is possible to compute the similarity of patients based on their ICD codes by using semantic similarity algorithms. These algorithms have been traditionally evaluated using a single-term expert rated data set. However, real-word patient data often display varying degrees of documented comorbidities that might impair algorithm performance. To account for this, we present a scale term that considers documented comorbidity-variance. In this work, we compared the performance of 80 combinations of established algorithms in terms of semantic similarity based on ICD-code sets. The sets have been extracted from patients with a C25.X (pancreatic cancer) primary diagnosis and provide a variety of different combinations of ICD-codes. Using our scale term we yielded the best results with a combination of level-based information content, Leacock & Chodorow concept similarity and bipartite graph matching for the set similarities reaching a correlation of 0.75 with our expert's ground truth. Our results highlight the importance of accounting for comorbidity variance while demonstrating how well current semantic similarity algorithms perform.

翻译：寻找相似患者是精准医学中的常见目标，有助于评估治疗结果和支持临床决策。选择广泛可用的患者特征和合适的数学方法进行相似度计算至关重要。国际疾病分类及相关健康问题统计分类（ICD）代码被全球用于编码疾病，几乎对所有患者均可获取。以主诊断和次诊断的集合形式汇总，这些代码可展示合并症的程度并揭示合并症模式。通过使用语义相似度算法，可以基于患者的ICD代码计算其相似性。传统上，这些算法使用单一术语的专家评分数据集进行评估。然而，真实患者数据通常显示出不同程度的已记录合并症，这可能会影响算法性能。为解决这一问题，我们提出了一个考虑已记录合并症变异性的比例项。在本研究中，我们比较了80种已建立算法组合在基于ICD代码集的语义相似度方面的性能。这些集合来自具有C25.X（胰腺癌）主要诊断的患者，提供了多种不同的ICD代码组合。使用我们的比例项，结合基于层次的信息内容、Leacock & Chodorow概念相似度和二分图匹配进行集合相似度计算，我们取得了最佳结果，与专家标注的基准数据相关系数达到0.75。我们的结果强调了考虑合并症变异性的重要性，同时展示了当前语义相似度算法的良好性能。