Semi-supervised learning is a model training method that uses both labeled and unlabeled data. This paper proposes a fully Bayes semi-supervised learning algorithm that can be applied to any multi-category classification problem. We assume the labels are missing at random when using unlabeled data in a semi-supervised setting. Suppose we have $K$ classes in the data. We assume that the observations follow $K$ multivariate normal distributions depending on their true class labels after some common unknown transformation is applied to each component of the observation vector. The function is expanded in a B-splines series, and a prior is added to the coefficients. We consider a normal prior on the coefficients and constrain the values to meet the normality and identifiability constraints requirement. The precision matrices of the Gaussian distributions are given a conjugate Wishart prior, while the means are given the improper uniform prior. The resulting posterior is still conditionally conjugate, and the Gibbs sampler aided by a data-augmentation technique can thus be adopted. An extensive simulation study compares the proposed method with several other available methods. The proposed method is also applied to real datasets on diagnosing breast cancer and classification of signals. We conclude that the proposed method has a better prediction accuracy in various cases.
翻译:半监督学习是一种同时利用标记与未标记数据的模型训练方法。本文提出一种完全贝叶斯半监督学习算法,可应用于任意多类别分类问题。在半监督设置中使用未标记数据时,我们假设标签是随机缺失的。假设数据中存在$K$个类别。我们假定观测向量各分量经过某种共同未知变换后,观测值依其真实类别标签服从$K$个多元正态分布。该函数通过B样条级数展开,并对系数施加先验分布。我们采用正态先验约束系数取值以满足正态性与可识别性约束要求。高斯分布的精度矩阵被赋予共轭Wishart先验,而均值则采用非正常均匀先验。所得后验分布仍保持条件共轭性,因此可采用数据增强技术辅助的吉布斯采样器。通过大量模拟研究将所提方法与现有多种方法进行比较,并应用于乳腺癌诊断和信号分类的真实数据集。实验表明所提方法在多种情况下具有更优的预测准确性。