The available data in semi-supervised learning usually consists of relatively small sized labeled data and much larger sized unlabeled data. How to effectively exploit unlabeled data is the key issue. In this paper, we write the regression function in the form of a copula and marginal distributions, and the unlabeled data can be exploited to improve the estimation of the marginal distributions. The predictions based on different copulas are weighted, where the weights are obtained by minimizing an asymptotic unbiased estimator of the prediction risk. Error-ambiguity decomposition of the prediction risk is performed such that unlabeled data can be exploited to improve the prediction risk estimation. We demonstrate the asymptotic normality of copula parameters and regression function estimators of the candidate models under the semi-supervised framework, as well as the asymptotic optimality and weight consistency of the model averaging estimator. Our model averaging estimator achieves faster convergence rates of asymptotic optimality and weight consistency than the supervised counterpart. Extensive simulation experiments and the California housing dataset demonstrate the effectiveness of the proposed method.
翻译:半监督学习中的可用数据通常包含规模相对较小的标注数据和规模大得多的未标注数据。如何有效利用未标注数据是该领域的核心问题。本文通过Copula函数与边缘分布的形式表示回归函数,可利用未标注数据改进边缘分布的估计。基于不同Copula函数的预测结果通过权重进行加权组合,该权重通过最小化预测风险的渐近无偏估计量获得。通过对预测风险进行误差-模糊度分解,可利用未标注数据改进预测风险的估计。我们证明了半监督框架下候选模型的Copula参数与回归函数估计量的渐近正态性,以及模型平均估计量的渐近最优性与权重一致性。相较于纯监督学习方法,本文提出的模型平均估计量在渐近最优性与权重一致性方面具有更快的收敛速度。大量模拟实验与加州住房数据集验证了所提方法的有效性。