High-dimensional data in modern applications, such as COVID-19 mortality, often span multiple domains. Leveraging auxiliary information from source domains to improve performance in a target domain motivates the use of transfer learning. However, a practical issue that has been overlooked is data contamination, which induces heterogeneity and can significantly degrade transfer learning performance. To address this challenge, we propose a novel approach that tackles transfer learning under data contamination within a structured regression setting. By employing the robust L2E criterion, we develop the TransL2E method that accounts for contamination in both target and source data while effectively transferring relevant information. Beyond robust estimation, TransL2E introduces a data-driven bi-level source detection mechanism, operating at both individual and cohort levels, which possesses multiple advantages over existing source detection approaches. Comprehensive simulation studies and a real data application demonstrate the superior performance of TransL2E in both robust estimation and structure recovery in the presence of data limitation and contamination.
翻译:现代应用中的高维数据(如COVID-19死亡率)通常跨越多个领域。利用源领域的辅助信息提升目标领域性能的需求催生了迁移学习方法。然而,一个被忽视的实际问题是数据污染——它会引发异质性并显著降低迁移学习性能。为应对这一挑战,我们提出了一种新颖方法,在结构化回归框架下解决数据污染场景中的迁移学习问题。通过采用鲁棒L2E准则,我们开发了TransL2E方法,该方法同时考虑目标数据与源数据中的污染问题,并有效传递相关信息。除了鲁棒估计之外,TransL2E还引入了一种数据驱动的双层源检测机制,该机制在个体层面与群体层面同时运作,相较于现有源检测方法具有多重优势。综合仿真实验与真实数据应用表明,在数据有限且存在污染的情况下,TransL2E在鲁棒估计与结构恢复方面均展现出卓越性能。