We consider Heterogeneous Transfer Learning (HTL) from a source to a new target domain for high-dimensional regression with differing feature sets. Most homogeneous TL methods assume that target and source domains share the same feature space, which limits their practical applicability. In applications, the target and source features are frequently different due to the inability to measure certain variables in data-poor target environments. Conversely, existing HTL methods do not provide statistical error guarantees, limiting their utility for scientific discovery. Our method first learns a feature map between the missing and observed features, leveraging the vast source data, and then imputes the missing features in the target. Using the combined matched and imputed features, we then perform a two-step transfer learning for penalized regression. We develop upper bounds on estimation and prediction errors, assuming that the source and target parameters differ sparsely but without assuming sparsity in the target model. We obtain results for both when the feature map is linear and when it is nonparametrically specified as unknown functions. Our results elucidate how estimation and prediction errors of HTL depend on the model's complexity, sample size, the quality and differences in feature maps, and differences in the models across domains.
翻译:本文研究从源域到新目标域的异构迁移学习(HTL),用于特征集不同的高维回归问题。大多数同构迁移学习方法假设目标域与源域共享相同的特征空间,这限制了其实际适用性。在实际应用中,由于在数据稀缺的目标环境中无法测量某些变量,目标特征与源特征往往存在差异。反之,现有的HTL方法缺乏统计误差的理论保证,限制了其在科学发现中的应用价值。我们的方法首先利用丰富的源数据学习缺失特征与观测特征之间的映射关系,进而对目标域中的缺失特征进行填补。随后,基于匹配特征与填补特征的组合,我们通过两步迁移学习实现惩罚回归。在假设源域与目标域参数差异具有稀疏性(但不对目标模型本身作稀疏性假设)的前提下,我们推导了估计误差与预测误差的上界。该结果同时适用于特征映射为线性形式及非参数形式(即未知函数)的情形。我们的研究阐明了HTL的估计误差与预测误差如何受模型复杂度、样本量、特征映射的质量与差异以及跨域模型差异等因素的影响。