Understanding Breast Cancer Survival: Using Causality and Language Models on Multi-omics Data

The need for more usable and explainable machine learning models in healthcare increases the importance of developing and utilizing causal discovery algorithms, which aim to discover causal relations by analyzing observational data. Explainable approaches aid clinicians and biologists in predicting the prognosis of diseases and suggesting proper treatments. However, very little research has been conducted at the crossroads between causal discovery, genomics, and breast cancer, and we aim to bridge this gap. Moreover, evaluation of causal discovery methods on real data is in general notoriously difficult because ground-truth causal relations are usually unknown, and accordingly, in this paper, we also propose to address the evaluation problem with large language models. In particular, we exploit suitable causal discovery algorithms to investigate how various perturbations in the genome can affect the survival of patients diagnosed with breast cancer. We used three main causal discovery algorithms: PC, Greedy Equivalence Search (GES), and a Generalized Precision Matrix-based one. We experiment with a subset of The Cancer Genome Atlas, which contains information about mutations, copy number variations, protein levels, and gene expressions for 705 breast cancer patients. Our findings reveal important factors related to the vital status of patients using causal discovery algorithms. However, the reliability of these results remains a concern in the medical domain. Accordingly, as another contribution of the work, the results are validated through language models trained on biomedical literature, such as BlueBERT and other large language models trained on medical corpora. Our results profess proper utilization of causal discovery algorithms and language models for revealing reliable causal relations for clinical applications.

翻译：医疗领域对更可用且可解释的机器学习模型的需求，日益凸显了开发与利用因果发现算法的重要性——此类算法旨在通过分析观测数据来发现因果关系。可解释性方法有助于临床医生和生物学家预测疾病预后并提出合适的治疗方案。然而，目前在因果发现、基因组学与乳腺癌的交叉领域研究甚少，我们旨在填补这一空白。此外，在真实数据上评估因果发现方法通常极为困难，因为真实因果关系往往未知，因此本文还提出利用大型语言模型来解决评估问题。具体而言，我们采用合适的因果发现算法，探究基因组中的各种扰动如何影响乳腺癌确诊患者的生存。我们使用了三种主要因果发现算法：PC算法、贪婪等价搜索（GES）算法以及一种基于广义精度矩阵的算法。实验基于癌症基因组图谱的一个子集，该数据集包含705名乳腺癌患者的突变、拷贝数变异、蛋白质水平及基因表达信息。我们的研究结果揭示了与患者生命状态相关的重要因素。然而，这些结果在医学领域的可靠性仍存在隐忧。因此，作为本研究的另一贡献，我们通过基于生物医学文献训练的语言模型（如BlueBERT及其他基于医学语料库训练的大型语言模型）对结果进行了验证。研究结果表明，合理利用因果发现算法与语言模型，可为临床应用揭示可靠的因果关系。