Point-Identification of a Robust Predictor Under Latent Shift with Imperfect Proxies

Addressing the domain adaptation problem becomes more challenging when distribution shifts across domains stem from latent confounders that affect both covariates and outcomes. Existing proxy-based approaches that address latent shift rely on a strong completeness assumption to uniquely determine (point-identify) a robust predictor. Completeness requires that proxies have sufficient information about variations in latent confounders. For imperfect proxies the mapping from confounders to the space of proxy distributions is non-injective, and multiple latent confounder values can generate the same proxy distribution. This breaks the completeness assumption and observed data are consistent with multiple potential predictors (set-identified). To address this, we introduce latent equivalent classes (LECs). LECs are defined as groups of latent confounders that induce the same conditional proxy distribution. We show that point-identification for the robust predictor remains achievable as long as multiple domains differ sufficiently in how they mix proxy-induced LECs to form the robust predictor. This domain diversity condition is formalized as a cross-domain rank condition on the mixture weights, which is substantially weaker assumption than completeness. We introduce the Proximal Quasi-Bayesian Active learning (PQAL) framework, which actively queries a small, targeted set of diverse domains that satisfy this rank condition. PQAL can recover the point-identified predictor, demonstrates robustness to varying degrees of shift and outperforms previous methods on synthetic data and semi-synthetic dSprites, IHDP, ACS Folktables datasets.

翻译：当跨域分布偏移源于同时影响协变量和结果的潜在混杂因素时，领域自适应问题变得更加具有挑战性。现有解决潜在偏移的代理变量方法依赖于强完备性假设来唯一确定（点识别）一个稳健预测器。完备性要求代理变量包含关于潜在混杂变量变化的充分信息。对于不完美代理变量，从混杂变量到代理变量分布空间的映射是非单射的，多个潜在混杂变量值可能产生相同的代理变量分布。这破坏了完备性假设，观测数据与多个潜在预测器（集合识别）一致。为解决此问题，我们引入潜在等价类（LECs）。LECs定义为能诱导相同条件代理变量分布的潜在混杂变量组。我们证明，只要多个领域在混合代理变量诱导的LECs以形成稳健预测器的方式上存在足够差异，稳健预测器的点识别仍然可以实现。这种领域多样性条件形式化为混合权重的跨域秩条件，该假设比完备性假设弱得多。我们提出近端准贝叶斯主动学习（PQAL）框架，该框架主动查询满足该秩条件的小规模、针对性不同的领域集。PQAL能够恢复点识别预测器，展示了对不同程度偏移的鲁棒性，并在合成数据和半合成dSprites、IHDP、ACS Folktables数据集上优于先前方法。