Multilabel Classification (MLC) deals with the simultaneous classification of multiple binary labels. The task is challenging because, not only may there be arbitrarily different and complex relationships between predictor variables and each label, but associations among labels may exist even after accounting for effects of predictor variables. In this paper, we present a Bayesian additive regression tree (BART) framework to model the problem. BART is a nonparametric and flexible model structure capable of uncovering complex relationships within the data. Our adaptation, MLCBART, assumes that labels arise from thresholding an underlying numeric scale, where a multivariate normal model allows explicit estimation of the correlation structure among labels. This enables the discovery of complicated relationships in various forms and improves MLC predictive performance. Our Bayesian framework not only enables uncertainty quantification for each predicted label, but our MCMC draws produce an estimated conditional probability distribution of label combinations for any predictor values. Simulation experiments demonstrate the effectiveness of the proposed model by comparing its performance with a set of models, including the oracle model with the correct functional form. Results show that our model predicts vectors of labels more accurately than other contenders and its performance is close to the oracle model. An example highlights how the method's ability to produce measures of uncertainty on predictions provides nuanced understanding of classification results.
翻译:多标签分类(MLC)涉及对多个二元标签同时进行分类。该任务具有挑战性,不仅因为预测变量与每个标签之间可能存在任意不同且复杂的关系,而且在考虑预测变量的影响后,标签之间仍可能存在关联。本文提出一个贝叶斯加性回归树(BART)框架对该问题进行建模。BART是一种非参数且灵活的模型结构,能够揭示数据中的复杂关系。我们提出的改进模型MLCBART假设标签产生于对底层数值尺度的阈值化处理,其中多元正态模型允许显式估计标签间的相关性结构。这使得模型能够发现各种形式的复杂关系,并提升MLC的预测性能。我们的贝叶斯框架不仅能够量化每个预测标签的不确定性,而且通过MCMC抽样可以为任意预测变量值生成标签组合的条件概率分布估计。仿真实验通过将所提模型与一组模型(包括具有正确函数形式的理想模型)进行性能比较,证明了该模型的有效性。结果表明,我们的模型在标签向量的预测精度上优于其他竞争模型,且其性能接近理想模型。通过具体案例展示了该方法生成预测不确定性度量的能力,如何为分类结果提供细致入微的理解。