Consider the regression problem where the response $Y\in\mathbb{R}$ and the covariate $X\in\mathbb{R}^d$ for $d\geq 1$ are \textit{unmatched}. Under this scenario, we do not have access to pairs of observations from the distribution of $(X, Y)$, but instead, we have separate datasets $\{Y_i\}_{i=1}^n$ and $\{X_j\}_{j=1}^m$, possibly collected from different sources. We study this problem assuming that the regression function is linear and the noise distribution is known or can be estimated. We introduce an estimator of the regression vector based on deconvolution and demonstrate its consistency and asymptotic normality under an identifiability assumption. In the general case, we show that our estimator (DLSE: Deconvolution Least Squared Estimator) is consistent in terms of an extended $\ell_2$ norm. Using this observation, we devise a method for semi-supervised learning, i.e., when we have access to a small sample of matched pairs $(X_k, Y_k)$. Several applications with synthetic and real datasets are considered to illustrate the theory.
翻译:考虑响应变量$Y\in\mathbb{R}$与协变量$X\in\mathbb{R}^d$($d\geq 1$)处于非匹配(unmatched)状态下的回归问题。在此场景中,我们无法获取来自于$(X, Y)$联合分布的配对观测值,取而代之的是两个独立数据集$\{Y_i\}_{i=1}^n$和$\{X_j\}_{j=1}^m$,它们可能来自不同的数据源。本研究在回归函数为线性且噪声分布已知或可估计的假设下探讨该问题。我们提出一种基于去卷积的回归向量估计方法,并在可辨识性假设下证明其一致性和渐近正态性。在一般情形下,我们证明该估计量(DLSE:去卷积最小二乘估计量)在扩展的$\ell_2$范数意义下具有一致性。基于此发现,我们进一步设计了半监督学习方法,即当可以获得少量匹配样本对$(X_k, Y_k)$时的解决方案。文中通过合成数据集和真实数据集的多项应用验证了理论的有效性。