Semi-supervised learning is a model training method that uses both labeled and unlabeled data. This paper proposes a fully Bayes semi-supervised learning algorithm that can be applied to any binary classification problem. We assume the labels are missing at random when using unlabeled data in a semi-supervised setting. We assume that the observations follow two multivariate normal distributions depending on their true class labels after some common unknown transformation is applied to each component of the observation vector. The function is expanded in a B-splines series and a prior is put on the coefficients. We consider a normal prior on the coefficients and constrain the values to meet the requirement for normality and identifiability constraints. The precision matrices of the two Gaussian distributions have a conjugate Wishart prior, while the means have improper uniform priors. The resulting posterior is still conditionally conjugate, and the Gibbs sampler aided by a data augmentation technique can thus be adopted. An extensive simulation study compares the proposed method with several other available methods. The proposed method is also applied to real datasets on diagnosing breast cancer and classification of signals. We conclude that the proposed method has a better prediction accuracy in various cases.
翻译:半监督学习是一种同时使用标注数据和未标注数据的模型训练方法。本文提出了一种完全贝叶斯半监督学习算法,可应用于任何二分类问题。我们假设在半监督设置中使用未标注数据时,标签是随机缺失的。我们假设观测值在对其向量的每个分量施加某种共同的未知变换后,根据其真实类别标签分别服从两个多元正态分布。该函数通过B样条级数展开,并对系数施加先验分布。我们考虑对系数采用正态先验,并约束其取值以满足正态性要求和可识别性约束。两个高斯分布的精度矩阵具有共轭Wishart先验,而均值则采用非正常均匀先验。所得后验分布仍保持条件共轭性,因此可采用数据增强技术辅助的Gibbs采样器。通过大量模拟研究,将所提方法与多种现有方法进行比较。该方法还应用于乳腺癌诊断和信号分类的真实数据集。我们得出结论:所提方法在多种情况下具有更好的预测准确性。