Identifiable Latent Causal Content for Domain Adaptation under Latent Covariate Shift

Multi-source domain adaptation (MSDA) addresses the challenge of learning a label prediction function for an unlabeled target domain by leveraging both the labeled data from multiple source domains and the unlabeled data from the target domain. Conventional MSDA approaches often rely on covariate shift or conditional shift paradigms, which assume a consistent label distribution across domains. However, this assumption proves limiting in practical scenarios where label distributions do vary across domains, diminishing its applicability in real-world settings. For example, animals from different regions exhibit diverse characteristics due to varying diets and genetics. Motivated by this, we propose a novel paradigm called latent covariate shift (LCS), which introduces significantly greater variability and adaptability across domains. Notably, it provides a theoretical assurance for recovering the latent cause of the label variable, which we refer to as the latent content variable. Within this new paradigm, we present an intricate causal generative model by introducing latent noises across domains, along with a latent content variable and a latent style variable to achieve more nuanced rendering of observational data. We demonstrate that the latent content variable can be identified up to block identifiability due to its versatile yet distinct causal structure. We anchor our theoretical insights into a novel MSDA method, which learns the label distribution conditioned on the identifiable latent content variable, thereby accommodating more substantial distribution shifts. The proposed approach showcases exceptional performance and efficacy on both simulated and real-world datasets.

翻译：多源域适应（MSDA）通过利用多个源域的标记数据和目标域的无标记数据，解决为无标记目标域学习标签预测函数的挑战。传统MSDA方法通常依赖协变量偏移或条件偏移范式，这些范式假设各域间标签分布一致。然而，该假设在标签分布确实随域变化的实际场景中具有局限性，削弱了其在现实环境中的适用性。例如，来自不同地区的动物因饮食和遗传差异而展现出多样特征。受此启发，我们提出一种名为潜在协变量偏移（LCS）的新范式，该范式在域间引入了显著更强的变异性和适应性。值得注意的是，它为恢复标签变量的潜在原因（我们称之为潜在内容变量）提供了理论保证。在此新范式下，我们通过引入跨域潜在噪声，以及潜在内容变量和潜在风格变量，构建了一个精细的因果生成模型，以实现更细腻的观测数据渲染。我们证明，由于潜在内容变量具有灵活且独特的因果结构，它可在分块可辨识性下被识别。我们将理论洞见融入一种新颖的MSDA方法中，该方法学习以可辨识的潜在内容变量为条件的标签分布，从而适应更显著的分布偏移。所提方法在模拟数据集和真实数据集上均展现出卓越性能与有效性。