We study an unsupervised domain adaptation problem where the source domain consists of subpopulations defined by the binary label $Y$ and a binary background (or environment) $A$. We focus on a challenging setting in which one such subpopulation in the source domain is unobservable. Naively ignoring this unobserved group can result in biased estimates and degraded predictive performance. Despite this structured missingness, we show that the prediction in the target domain can still be recovered. Specifically, we rigorously derive both background-specific and overall prediction models for the target domain. For practical implementation, we propose the distribution matching method to estimate the subpopulation proportions. We provide theoretical guarantees for the asymptotic behavior of our estimator, and establish an upper bound on the prediction error. Experiments on both synthetic and real-world datasets show that our method outperforms the naive benchmark that does not account for this unobservable source subpopulation.
翻译:本文研究了一个无监督域适应问题,其中源域包含由二元标签$Y$与二元背景(或环境)$A$定义的子群。我们重点关注源域中某个子群不可观测的挑战性场景。若直接忽略该不可观测群体,可能导致估计偏差与预测性能下降。尽管存在这种结构性缺失,我们证明目标域的预测仍可恢复。具体而言,我们严格推导了目标域的背景特定预测模型与整体预测模型。为实现实际应用,我们提出通过分布匹配方法估计子群比例。我们为估计量的渐近行为提供了理论保证,并建立了预测误差的上界。在合成数据集与真实数据集上的实验表明,本方法优于未考虑该不可观测源域子群的基准方法。