Semi-supervised learning (SSL) is a common approach to learning predictive models using not only labeled examples, but also unlabeled examples. While SSL for the simple tasks of classification and regression has received a lot of attention from the research community, this is not properly investigated for complex prediction tasks with structurally dependent variables. This is the case of multi-label classification and hierarchical multi-label classification tasks, which may require additional information, possibly coming from the underlying distribution in the descriptive space provided by unlabeled examples, to better face the challenging task of predicting simultaneously multiple class labels. In this paper, we investigate this aspect and propose a (hierarchical) multi-label classification method based on semi-supervised learning of predictive clustering trees. We also extend the method towards ensemble learning and propose a method based on the random forest approach. Extensive experimental evaluation conducted on 23 datasets shows significant advantages of the proposed method and its extension with respect to their supervised counterparts. Moreover, the method preserves interpretability and reduces the time complexity of classical tree-based models.
翻译:半监督学习(SSL)是一种常见方法,不仅利用有标签样本,还利用无标签样本来学习预测模型。尽管针对分类和回归这类简单任务的半监督学习已受到研究界的广泛关注,但对于具有结构依赖变量的复杂预测任务(如多标签分类和层次化多标签分类),相关研究尚未充分探索。这类任务可能需要额外信息(可能来自无标签样本在描述空间中提供的底层分布),以更好地应对同时预测多个类别标签这一挑战性问题。本文研究了这一方面,提出了一种基于预测聚类树半监督学习的(层次化)多标签分类方法。我们还将该方法扩展至集成学习领域,提出了一种基于随机森林的方法。在23个数据集上进行的广泛实验评估表明,所提方法及其扩展相对于监督学习方法具有显著优势。此外,该方法保持了可解释性,并降低了经典树模型的时间复杂度。