Sparse high-dimensional signal recovery is only possible under certain conditions on the number of parameters, sample size, signal strength and underlying sparsity. We show that leveraging external information, as possible with data integration or transfer learning, allows to push these mathematical limits. Specifically, we consider external information that allows splitting parameters into blocks, first in a simplified case, the Gaussian sequence model, and then in the general linear regression setting. We show how external information dependent, block-based, $\ell_0$ penalties attain model selection consistency under milder conditions than standard $\ell_0$ penalties, and they also attain faster model recovery rates. We first provide results for oracle-based $\ell_0$ penalties that have access to perfect sparsity and signal strength information. Subsequently, we propose an empirical Bayes data analysis method that does not require oracle information and for which efficient computation is possible via standard MCMC techniques. Our results provide a mathematical basis to justify the use of data integration methods in high-dimensional structural learning.
翻译:稀疏高维信号恢复仅在参数数量、样本量、信号强度与底层稀疏性满足特定条件时方可实现。本文证明,通过数据整合或迁移学习等方式利用外部信息,能够突破这些数学限制。具体而言,我们研究允许将参数划分为区块的外部信息,首先在简化案例(高斯序列模型)中展开分析,随后推广至一般线性回归场景。研究表明:相较于标准$\ell_0$惩罚项,依赖外部信息的基于区块的$\ell_0$惩罚项可在更宽松条件下实现模型选择一致性,并获得更快的模型恢复速率。我们首先给出基于完美稀疏性与信号强度信息的理想化$\ell_0$惩罚项的理论结果,随后提出一种无需理想化信息的经验贝叶斯数据分析方法,该方法可通过标准MCMC技术实现高效计算。本研究为高维结构学习中数据整合方法的运用提供了数学基础。