Clinical variant classification of pathogenic versus benign genetic variants remains a pivotal challenge in clinical genetics. Recently, the proposition of protein language models has improved the generic variant effect prediction (VEP) accuracy via weakly-supervised or unsupervised training. However, these VEPs are not disease-specific, limiting their adaptation at point-of-care. To address this problem, we propose a disease-specific \textsc{pro}tein language model for variant \textsc{path}ogenicity, termed ProPath, to capture the pseudo-log-likelihood ratio in rare missense variants through a siamese network. We evaluate the performance of ProPath against pre-trained language models, using clinical variant sets in inherited cardiomyopathies and arrhythmias that were not seen during training. Our results demonstrate that ProPath surpasses the pre-trained ESM1b with an over $5\%$ improvement in AUC across both datasets. Furthermore, our model achieved the highest performances across all baselines for both datasets. Thus, our ProPath offers a potent disease-specific variant effect prediction, particularly valuable for disease associations and clinical applicability.
翻译:临床遗传学中,致病性基因变体与良性基因变体的分类仍是一个关键挑战。近年来,蛋白质语言模型的提出通过弱监督或无监督训练提升了通用变体效应预测(VEP)的准确性。然而,这些VEP方法不具备疾病特异性,限制了其在临床即时检验中的应用。为解决此问题,我们提出了一种名为ProPath的疾病特异性蛋白质语言模型,用于变体致病性预测,通过孪生网络捕捉罕见错义变体中的伪对数似然比。我们利用训练中未见过的遗传性心肌病和心律失常临床变体数据集,评估了ProPath与预训练语言模型的性能。结果表明,在两个数据集上,ProPath的AUC值均比预训练的ESM1b提升超过5%。此外,我们的模型在所有基线方法中均取得了最高性能。因此,ProPath提供了强大的疾病特异性变体效应预测,尤其对疾病关联分析和临床应用具有重要价值。