Classification of unlabeled data is usually achieved by supervised learning from labeled samples. Although there exist many sophisticated supervised machine learning methods that can predict the missing labels with a high level of accuracy, they often lack the required transparency in situations where it is important to provide interpretable results and meaningful measures of confidence. Body fluid classification of forensic casework data is the case in point. We develop a new Biclustering Dirichlet Process for Class-assignment with Random Matrices (BDP-CaRMa), with a three-level hierarchy of clustering, and a model-based approach to classification that adapts to block structure in the data matrix. As the class labels of some observations are missing, the number of rows in the data matrix for each class is unknown. BDP-CaRMa handles this and extends existing biclustering methods by simultaneously biclustering multiple matrices each having a randomly variable number of rows. We demonstrate our method by applying it to the motivating problem, which is the classification of body fluids based on mRNA profiles taken from crime scenes. The analyses of casework-like data show that our method is interpretable and produces well-calibrated posterior probabilities. Our model can be more generally applied to other types of data with a similar structure to the forensic data.
翻译:无标签数据的分类通常依赖于已标注样本的监督学习。尽管存在许多精密的监督机器学习方法能够高精度预测缺失标签,但在需要提供可解释结果和有意义的置信度指标的重要场景中,这些方法往往缺乏必要的透明度。法医案件数据的体液分类正是典型案例。我们提出一种新的基于随机矩阵的分区狄利克雷过程分类双聚类方法(BDP-CaRMa),该方法包含三层聚类层级结构,并采用基于模型的分类策略以适应数据矩阵中的块状结构。由于部分观测数据的类别标签缺失,每个类别对应数据矩阵的行数未知。BDP-CaRMa通过同时双聚类多个行数随机可变的数据矩阵解决了该问题,并扩展了现有双聚类方法。我们将该方法应用于实际案例——基于犯罪现场mRNA图谱的体液分类。案件模拟数据分析表明,该方法具有可解释性,并能生成校准良好的后验概率。该模型可更广泛地应用于与法医数据具有类似结构的其他数据类型。