Semantic similarity measures (SSMs) are widely used in biomedical research but remain underutilized in pharmacovigilance. This study evaluates six ontology-based SSMs for clustering MedDRA Preferred Terms (PTs) in drug safety data. Using the Unified Medical Language System (UMLS), we assess each method's ability to group PTs around medically meaningful centroids. A high-throughput framework was developed with a Java API and Python and R interfaces support large-scale similarity computations. Results show that while path-based methods perform moderately with F1 scores of 0.36 for WUPALMER and 0.28 for LCH, intrinsic information content (IC)-based measures, especially INTRINSIC-LIN and SOKAL, consistently yield better clustering accuracy (F1 score of 0.403). Validated against expert review and standard MedDRA queries (SMQs), our findings highlight the promise of IC-based SSMs in enhancing pharmacovigilance workflows by improving early signal detection and reducing manual review.
翻译:语义相似性度量(SSMs)在生物医学研究中被广泛使用,但在药物警戒领域仍未得到充分利用。本研究评估了六种基于本体的SSMs,用于在药物安全数据中对MedDRA首选术语(PTs)进行聚类。利用统一医学语言系统(UMLS),我们评估了每种方法围绕具有医学意义的中心点对PTs进行分组的能力。我们开发了一个高通量框架,该框架包含Java API以及Python和R接口,支持大规模相似性计算。结果表明,虽然基于路径的方法表现中等(WUPALMER的F1分数为0.36,LCH为0.28),但基于内在信息内容(IC)的度量方法,尤其是INTRINSIC-LIN和SOKAL,始终能产生更好的聚类精度(F1分数为0.403)。通过与专家评审和标准MedDRA查询(SMQs)进行验证,我们的研究结果凸显了基于IC的SSMs在通过改进早期信号检测和减少人工评审来增强药物警戒工作流程方面的潜力。