Semi-supervised learning is extensively applied these days to estimate classifiers from training data in which not all of the labels of the feature vectors are available. With the use of generative models that propose a form for the joint distribution of a feature vector and its ground-truth label, the Bayes' classifier can be estimated via maximum likelihood on partially classified training data. To increase the accuracy of this sample classifier, \cite{ahfock2020apparent} proposed that a missing-label mechanism be adopted and that the Bayes' classifier be estimated on the basis of the full likelihood formed in the framework that models the probability of a missing label given its observed feature vector in terms of its entropy. In the case of two Gaussian classes with a common covariance matrix, it was shown that the accuracy of the classifier so estimated from the partially classified training data can even have lower error rate than if it were estimated from the sample completely classified. Here, we focus on an algorithm for estimating the Bayes' classifier via the full likelihood in the case of multiple Gaussian classes with arbitrary covariance matrices. Different strategies for initializing the algorithm are discussed and illustrated. A new \proglang{R} package with these tools, \texttt{gmmsslm}, is demonstrated on real data.
翻译:半监督学习目前广泛应用于从部分标注训练数据中估计分类器,这些数据中并非所有特征向量的标签都可用。通过采用生成式模型(该模型对特征向量及其真实标签的联合分布提出某种形式),可以基于部分分类训练数据的最大似然估计来获得贝叶斯分类器。为提升该样本分类器的精度,\cite{ahfock2020apparent}提出采用缺失标签机制,并在建模框架下基于完全似然估计贝叶斯分类器,该框架通过熵来衡量给定观测特征向量时标签缺失的概率。对于具有共同协方差矩阵的两个高斯类别的场景,研究表明:基于部分分类训练数据估计的分类器精度甚至可能优于完全分类样本估计的分类器。本文重点研究针对具有任意协方差矩阵的多高斯类别场景,通过完全似然估计贝叶斯分类器的算法。我们讨论并展示了不同的算法初始化策略,并通过真实数据演示了包含这些工具的新R语言包\texttt{gmmsslm}。