A Scalable Predictive Modelling Approach to Identifying Duplicate Adverse Event Reports for Drugs and Vaccines

Objectives: To advance state-of-the-art for duplicate detection in large-scale pharmacovigilance databases and achieve more consistent performance across adverse event reports from different countries. Background: Unlinked adverse event reports referring to the same case impede statistical analysis and may mislead clinical assessment. Pharmacovigilance relies on large databases of adverse event reports to discover potential new causal associations, and computational methods are required to identify duplicates at scale. Current state-of-the-art is statistical record linkage which outperforms rule-based approaches. In particular, vigiMatch is in routine use for VigiBase, the WHO global database of adverse event reports, and represents the first statistical duplicate detection approach in pharmacovigilance deployed at scale. Originally developed for both medicines and vaccines, its application to vaccines has been limited due to inconsistent performance across countries. Methods: This paper extends vigiMatch from probabilistic record linkage to predictive modelling, refining features for medicines, vaccines, and adverse events using country-specific reporting rates, extracting dates from free text, and training separate support vector machine classifiers for medicines and vaccines. Recall was evaluated using 5 independent labelled test sets. Precision was assessed by annotating random selections of report pairs classified as duplicates. Results: Precision for the new method was 92% for vaccines and 54% for medicines, compared with 41% for the comparator method. Recall ranged from 80-85% across test sets for vaccines and from 40-86% for medicines, compared with 24-53% for the comparator method. Conclusion: Predictive modeling, use of free text, and country-specific features advance state-of-the-art for duplicate detection in pharmacovigilance.

翻译：目标：提升大规模药物警戒数据库中重复检测的技术水平，并在来自不同国家的不良事件报告中实现更一致的性能。背景：指代同一病例的未关联不良事件报告会阻碍统计分析，并可能误导临床评估。药物警戒依赖于大型不良事件报告数据库来发现潜在的新因果关系，因此需要计算方法来大规模识别重复报告。当前最先进的技术是统计记录链接方法，其性能优于基于规则的方法。特别是，vigiMatch已常规用于WHO全球不良事件报告数据库VigiBase，代表了药物警戒领域首个大规模部署的统计重复检测方法。该方法最初为药物和疫苗开发，但由于在不同国家间性能不一致，其在疫苗方面的应用受到限制。方法：本文将vigiMatch从概率记录链接扩展为预测建模，通过使用特定国家报告率优化药物、疫苗和不良事件的特征，从自由文本中提取日期，并为药物和疫苗分别训练支持向量机分类器。使用5个独立标记测试集评估召回率。通过人工标注随机选择的被分类为重复的报告对来评估精确率。结果：新方法的精确率在疫苗方面为92%，在药物方面为54%，而对照方法为41%。召回率在疫苗测试集中为80-85%，在药物测试集中为40-86%，而对照方法为24-53%。结论：预测建模、自由文本利用和国家特异性特征的应用提升了药物警戒中重复检测的技术水平。