The Critical Role of Model Selection in Causal Inference: A Comparative Analysis of Classification Models within the InferBERT Framework for Pharmacovigilance

翻译：模型选择在因果推断中的关键作用：InferBERT框架内用于药物警戒的分类模型比较分析

Csaba Kiss,Roland Molontay,Gabriele Pergola

from arxiv, 10 pages, 5 figures

Distinguishing causal adverse drug events (ADEs) from spurious correlations remains a central challenge in pharmacovigilance. The InferBERT framework integrates transformer models with Do-calculus, but its success hinges on the underlying classification model. This study evaluates the impact of model choice in InferBERT, assessing whether simpler models suffice, if domain-specific pre-training helps, whether scaling to LLMs improves causal detection, and the effect of post-hoc calibration. We performed a comparative study on two benchmarks: Analgesics-induced Acute Liver Failure (AILF) and Tramadol-related Mortalities (TRAM). Four models were evaluated-XGBoost (baseline), ALBERT (original InferBERT), BioBERT (biomedical transformer), and Med-LLaMA (medical LLM)-using 5-fold cross-validation repeated over 20 runs. We measured accuracy, Expected Calibration Error (ECE) pre- and post-isotonic regression, and Jaccard concordance of causal terms with PRR, ROR, and EBGM; significance was tested with paired t-tests. BioBERT achieved the highest accuracy on both datasets, while Med-LLaMA underperformed despite its size and parameter-efficient fine-tuning. Domain-specific pre-training was decisive. Calibration improved ECE but had mixed effects on accuracy and causal discovery. BioBERT's superiority also yielded the strongest concordance with traditional pharmacovigilance signals. These results show that domain-specific pre-training provides a clear advantage over simpler baselines and larger LLMs. Investing in manageable, domain-aware models is more effective for computational pharmacovigilance than simply scaling model size.

翻译：区分因果性药物不良事件（ADE）与虚假关联仍是药物警戒领域的核心挑战。InferBERT框架将Transformer模型与Do-演算相结合，但其成功取决于底层分类模型的选择。本研究评估了InferBERT中模型选择的影响，探讨以下问题：更简单的模型是否足够有效？领域特异性预训练是否有助益？扩展至大语言模型（LLM）能否提升因果检测能力？以及事后校准的效果。我们在两个基准数据集上开展了比较研究：镇痛药诱导的急性肝衰竭（AILF）与曲马多相关死亡率（TRAM）。采用5折交叉验证并重复20次实验，评估了四种模型——XGBoost（基线模型）、ALBERT（原始InferBERT）、BioBERT（生物医学Transformer）和Med-LLaMA（医学大语言模型）。我们测量了准确率、等渗回归前后的预期校准误差（ECE），以及因果术语与PRR、ROR、EBGM的Jaccard一致性；显著性通过配对t检验验证。BioBERT在两个数据集上均取得最高准确率，而Med-LLaMA尽管参数量大且采用参数高效微调，性能却不尽如人意。领域特异性预训练成为决定性因素。校准虽改善了ECE，但对准确率和因果发现的影响呈现混合效果。BioBERT的优越性还使其与传统药物警戒信号的一致性最高。这些结果表明，领域特异性预训练相比简单基线和规模更大的LLM具有显著优势。对于计算药物警戒而言，投入资源开发可管理且具备领域意识的模型，比单纯扩大模型规模更为有效。