Transfer learning plays a key role in modern data analysis when: (1) the target data are scarce but the source data are sufficient; (2) the distributions of the source and target data are heterogeneous. This paper develops an interpretable unified transfer learning model, termed as UTrans, which can detect both transferable variables and source data. More specifically, we establish the estimation error bounds and prove that our bounds are lower than those with target data only. Besides, we propose a source detection algorithm based on hypothesis testing to exclude the nontransferable data. We evaluate and compare UTrans to the existing algorithms in multiple experiments. It is shown that UTrans attains much lower estimation and prediction errors than the existing methods, while preserving interpretability. We finally apply it to the US intergenerational mobility data and compare our proposed algorithms to the classical machine learning algorithms.
翻译:迁移学习在现代数据分析中发挥关键作用,具体场景包括:(1)目标数据稀缺而源数据充足;(2)源数据与目标数据的分布存在异质性。本文提出一种可解释的统一迁移学习模型(简称UTrans),该模型能够同时检测可迁移变量与可迁移源数据。具体而言,我们建立了估计误差界并证明其优于仅使用目标数据时的误差界。此外,我们提出基于假设检验的源数据检测算法以排除不可迁移数据。通过多组实验将UTrans与现有算法进行对比评估,结果表明UTrans在保持可解释性的同时,其估计误差与预测误差均显著低于现有方法。最后,我们将该模型应用于美国代际流动性数据,并将所提算法与经典机器学习算法进行比较。