Classification of unlabeled data is usually achieved by supervised learning from labeled samples. Although there exist many sophisticated supervised machine learning methods that can predict the missing labels with a high level of accuracy, they often lack the required transparency in situations where it is important to provide interpretable results and meaningful measures of confidence. Body fluid classification of forensic casework data is the case in point. We develop a new Biclustering Dirichlet Process (BDP), with a three-level hierarchy of clustering, and a model-based approach to classification which adapts to block structure in the data matrix. As the class labels of some observations are missing, the number of rows in the data matrix for each class is unknown. The BDP handles this and extends existing biclustering methods by simultaneously biclustering multiple matrices each having a randomly variable number of rows. We demonstrate our method by applying it to the motivating problem, which is the classification of body fluids based on mRNA profiles taken from crime scenes. The analyses of casework-like data show that our method is interpretable and produces well-calibrated posterior probabilities. Our model can be more generally applied to other types of data with a similar structure to the forensic data.
翻译:无标签数据的分类通常通过有标签样本的监督学习实现。尽管存在众多复杂的监督机器学习方法能以高精度预测缺失标签,但在需要提供可解释结果和有意义的置信度指标的重要场景中,这些方法往往缺乏必要的透明度。法医案件数据的体液分类正是此类典型问题。我们提出了一种新的双聚类狄利克雷过程(BDP),该过程具有三级层次聚类结构,并采用基于模型的自适应分类方法以匹配数据矩阵中的块状结构。由于部分观测数据的类别标签缺失,每类数据矩阵的行数未知。BDP通过同时双聚类多个行数随机可变的矩阵解决了这一问题,并扩展了现有双聚类方法。我们将该方法应用于驱动性研究问题——基于犯罪现场提取的mRNA图谱进行体液分类,从而展示了其效能。对类案件数据的分析表明,我们的方法具有可解释性,并能产生校准良好的后验概率。该模型可更广泛地应用于与法医数据具有相似结构的其他数据类型。