Multi-source domain adaptation (DA) aims at leveraging information from more than one source domain to make predictions in a target domain, where different domains may have different data distributions. Most existing methods for multi-source DA focus on classification problems while there is only limited investigation in the regression settings. In this paper, we fill in this gap through a two-step procedure. First, we extend a flexible single-source DA algorithm for classification through outcome-coarsening to enable its application to regression problems. We then augment our single-source DA algorithm for regression with ensemble learning to achieve multi-source DA. We consider three learning paradigms in the ensemble algorithm, which combines linearly the target-adapted learners trained with each source domain: (i) a multi-source stacking algorithm to obtain the ensemble weights; (ii) a similarity-based weighting where the weights reflect the quality of DA of each target-adapted learner; and (iii) a combination of the stacking and similarity weights. We illustrate the performance of our algorithms with simulations and a data application where the goal is to predict High-density lipoprotein (HDL) cholesterol levels using gut microbiome. We observe a consistent improvement in prediction performance of our multi-source DA algorithm over the routinely used methods in all these scenarios.
翻译:多源域自适应旨在利用多个源域的信息对目标域进行预测,其中不同域可能具有不同的数据分布。现有大多数多源域自适应方法聚焦于分类问题,而针对回归设置的探索较为有限。本文通过两步法填补这一空白:首先,将面向分类任务的基于结果粗化的灵活单源域自适应算法扩展至回归问题;其次,通过集成学习增强该单源域自适应回归算法以实现多源域自适应。在集成算法中我们考虑三种学习范式,将各源域训练的目标自适应学习器线性组合:(i) 基于多源堆叠算法获取集成权重;(ii) 基于相似性的加权方法,权重反映各目标自适应学习器的域自适应质量;(iii) 堆叠权重与相似性权重的组合。通过模拟实验及以肠道微生物组预测高密度脂蛋白胆固醇水平的数据应用验证算法性能,在所有场景中观察到所提多源域自适应算法的预测性能相较常规方法具有持续提升。