Incorporation of external information into high-dimensional modeling for gene expression data has been shown, both theoretically and empirically, to substantially enhance performance. Such external information, sometimes referred to as prior information or priors, has become increasingly accessible from multiple sources, yet its reliability may vary considerably. Existing approaches often integrate these priors without sufficiently accounting for their quality, which may result in unsatisfactory or even misleading results. To effectively and selectively exploit such priors, we propose adaptive Multi-Prior Lasso, a novel regularization approach that simultaneously identifies reliable prior sources and integrates them to improve model performance. For high-dimensional generalized linear models (GLMs), an adaptive data-driven weight is assigned to each prior, so that more reliable sources are emphasized while less credible ones are downweighted. Theoretical guarantees are established, and the proposed method is shown through extensive simulations to improve estimation, prediction, and variable selection. An application to TCGA breast cancer gene expression data further illustrates the practical value of the proposed method, showing that incorporating prior information from PubMed published studies improves model performance.
翻译:将外部信息融入基因表达数据的高维建模中,已在理论和实证层面被证明能显著提升性能。这类外部信息(有时称为先验信息或先验)正日益从多种来源获取,但其可靠性可能差异显著。现有方法在整合这些先验时往往未充分考虑其质量,可能导致结果不理想甚至产生误导。为有效且选择性地利用这些先验,我们提出自适应多先验Lasso——一种新颖的正则化方法,能同时识别可靠先验来源并整合它们以提升模型性能。针对高维广义线性模型(GLMs),我们为每个先验赋予自适应数据驱动权重,使得更可靠的来源得以强化,而可信度较低者则被弱化。该方法具有理论保证,并通过大量模拟实验证明其在估计、预测和变量选择方面均有所改进。应用于TCGA乳腺癌基因表达数据进一步验证了该方法的实际价值,表明整合PubMed已发表研究的先验信息可提升模型性能。