Motivation: Standard genome-wide association studies in cancer genomics rely on statistical significance with multiple testing correction, but systematically fail in underpowered cohorts. In TCGA breast cancer (n=967, 133 deaths), low event rates (13.8%) create severe power limitations, producing false negatives for known drivers and false positives for large passenger genes. Results: We developed a five-criteria computational framework integrating causal inference (inverse probability weighting, doubly robust estimation) with orthogonal biological validation (expression, mutation patterns, literature evidence). Applied to TCGA-BRCA mortality analysis, standard Cox+FDR detected zero genes at FDR<0.05, confirming complete failure in underpowered settings. Our framework correctly identified RYR2 - a cardiac gene with no cancer function - as a false positive despite nominal significance (p=0.024), while identifying KMT2C as a complex candidate requiring validation despite marginal significance (p=0.047, q=0.954). Power analysis revealed median power of 15.1% across genes, with KMT2C achieving only 29.8% power (HR=1.55), explaining borderline statistical significance despite strong biological evidence. The framework distinguished true signals from artifacts through mutation pattern analysis: RYR2 showed 29.8% silent mutations (passenger signature) with no hotspots, while KMT2C showed 6.7% silent mutations with 31.4% truncating variants (driver signature). This multi-evidence approach provides a template for analyzing underpowered cohorts, prioritizing biological interpretability over purely statistical significance. Availability: All code and analysis pipelines available at github.com/akarlaraytu/causal- inference-for-cancer-genomics
翻译:动机:癌症基因组学中的标准全基因组关联研究依赖于多重检验校正后的统计显著性,但在统计效能不足的队列中系统性失效。在TCGA乳腺癌数据(n=967,133例死亡)中,低事件发生率(13.8%)造成了严重的统计效能限制,导致对已知驱动基因产生假阴性结果,并对大型乘客基因产生假阳性结果。结果:我们开发了一个五标准计算框架,将因果推断(逆概率加权、双重稳健估计)与正交生物学验证(表达、突变模式、文献证据)相结合。应用于TCGA-BRCA死亡率分析时,标准Cox+FDR方法在FDR<0.05水平下未检测到任何基因,证实了在统计效能不足场景下的完全失效。我们的框架正确地将RYR2(一种无癌症功能的心脏基因)识别为假阳性,尽管其具有名义显著性(p=0.024);同时将KMT2C识别为需要验证的复杂候选基因,尽管其仅具有边缘显著性(p=0.047,q=0.954)。统计效能分析显示所有基因的中位统计效能为15.1%,KMT2C仅达到29.8%的统计效能(HR=1.55),这解释了尽管存在强生物学证据却仅获得边界统计学显著性的原因。该框架通过突变模式分析区分了真实信号与伪影:RYR2显示出29.8%的同义突变(乘客基因特征)且无热点突变,而KMT2C显示出6.7%的同义突变和31.4%的截短变异(驱动基因特征)。这种多证据方法为分析统计效能不足的队列提供了模板,优先考虑生物学可解释性而非纯粹的统计显著性。可用性:所有代码和分析流程可在github.com/akarlaraytu/causal-inference-for-cancer-genomics获取