Breast cancer's complexity and variability pose significant challenges in understanding its progression and guiding effective treatment. This study aims to integrate protein sequence data with expression levels to improve the molecular characterization of breast cancer subtypes and predict clinical outcomes. Using ProtGPT2, a language model specifically designed for protein sequences, we generated embeddings that capture the functional and structural properties of proteins. These embeddings were integrated with protein expression levels to form enriched biological representations, which were analyzed using machine learning methods, such as ensemble K-means for clustering and XGBoost for classification. Our approach enabled the successful clustering of patients into biologically distinct groups and accurately predicted clinical outcomes such as survival and biomarker status, achieving high performance metrics, notably an F1 score of 0.88 for survival and 0.87 for biomarker status prediction. Feature importance analysis identified KMT2C, CLASP2, and MYO1B as key proteins involved in hormone signaling, cytoskeletal remodeling, and therapy resistance in hormone receptor-positive and triple-negative breast cancer, with potential influence on breast cancer subtype behavior and progression. Furthermore, protein-protein interaction networks and correlation analyses revealed functional interdependencies among proteins that may influence the behavior and progression of breast cancer subtypes. These findings suggest that integrating protein sequence and expression data provides valuable insights into tumor biology and has significant potential to enhance personalized treatment strategies in breast cancer care.
翻译:乳腺癌的复杂性和变异性对其进展机制的理解和有效治疗策略的制定构成了重大挑战。本研究旨在整合蛋白质序列数据与表达水平,以改进乳腺癌亚型的分子特征描述并预测临床结局。利用专门为蛋白质序列设计的语言模型ProtGPT2,我们生成了能够捕捉蛋白质功能和结构特性的嵌入表示。这些嵌入表示与蛋白质表达水平整合,形成了丰富的生物学表征,并通过机器学习方法进行分析,例如使用集成K-means进行聚类,使用XGBoost进行分类。我们的方法成功将患者聚类为生物学上不同的组别,并准确预测了生存期和生物标志物状态等临床结局,取得了较高的性能指标,特别是在生存期预测中F1分数达到0.88,生物标志物状态预测中达到0.87。特征重要性分析确定了KMT2C、CLASP2和MYO1B为关键蛋白质,这些蛋白质在激素受体阳性和三阴性乳腺癌中参与激素信号传导、细胞骨架重塑和治疗抵抗,可能影响乳腺癌亚型的行为和进展。此外,蛋白质-蛋白质相互作用网络和相关性分析揭示了蛋白质之间的功能相互依赖性,这些依赖性可能影响乳腺癌亚型的行为和进展。这些发现表明,整合蛋白质序列和表达数据为肿瘤生物学提供了有价值的见解,并具有显著潜力以增强乳腺癌护理中的个性化治疗策略。