Multi-source domain adaptation aims to reduce performance degradation when applying machine learning models to unseen domains. A fundamental challenge is devising the optimal strategy for feature selection. Existing literature is somewhat paradoxical: some advocate for learning invariant features from source domains, while others favor more diverse features. To address the challenge, we propose a statistical framework that distinguishes the utilities of features based on the variance of their correlation to label $y$ across domains. Under our framework, we design and analyze a learning procedure consisting of learning approximately shared feature representation from source tasks and fine-tuning it on the target task. Our theoretical analysis necessitates the importance of learning approximately shared features instead of only the strictly invariant features and yields an improved population risk compared to previous results on both source and target tasks, thus partly resolving the paradox mentioned above. Inspired by our theory, we proposed a more practical way to isolate the content (invariant+approximately shared) from environmental features and further consolidate our theoretical findings.
翻译:多源域自适应旨在降低机器学习模型应用于未见领域时的性能下降。其核心挑战在于设计最优的特征选择策略。现有文献观点存在一定矛盾:部分研究主张从源域学习不变特征,而另一些则偏好更具多样性的特征。为应对这一挑战,我们提出一个统计框架,根据不同特征与标签 $y$ 的跨域相关性方差来区分其特征效用。在此框架下,我们设计并分析了一个学习流程:首先从源任务中学习近似共享特征表示,随后在目标任务上进行微调。我们的理论分析揭示了学习近似共享特征(而非仅学习严格不变特征)的重要性,并在源任务与目标任务上相较于先前结果获得了更优的总体风险,从而部分解决了上述矛盾。受理论启发,我们提出了一种更实用的方法,将内容特征(不变特征+近似共享特征)与环境特征分离,并进一步验证了理论结论。