Semi-supervised learning (SSL) aims to train a machine learning model using both labelled and unlabelled data. While the unlabelled data have been used in various ways to improve the prediction accuracy, the reason why unlabelled data could help is not fully understood. One interesting and promising direction is to understand SSL from a causal perspective. In light of the independent causal mechanisms principle, the unlabelled data can be helpful when the label causes the features but not vice versa. However, the causal relations between the features and labels can be complex in real world applications. In this paper, we propose a SSL framework that works with general causal models in which the variables have flexible causal relations. More specifically, we explore the causal graph structures and design corresponding causal generative models which can be learned with the help of unlabelled data. The learned causal generative model can generate synthetic labelled data for training a more accurate predictive model. We verify the effectiveness of our proposed method by empirical studies on both simulated and real data.
翻译:半监督学习(SSL)旨在同时利用标注数据和未标注数据训练机器学习模型。尽管未标注数据已以多种方式用于提升预测精度,但其为何能提供帮助的机理尚未被完全理解。一个有趣且前景广阔的研究方向是从因果视角理解SSL。根据独立因果机制原理,当标签导致特征生成而非相反时,未标注数据可能发挥作用。然而在实际应用中,特征与标签间的因果关系可能十分复杂。本文提出一种适用于广义因果模型的SSL框架,该模型允许变量间存在灵活的因果关系。具体而言,我们探索因果图结构并设计相应的因果生成模型,这些模型可借助未标注数据进行学习。习得的因果生成模型能够生成合成标注数据,用于训练更精准的预测模型。我们通过模拟数据和真实数据的实证研究验证了所提方法的有效性。