This study introduces a novel hierarchical divisive clustering approach with stochastic splitting functions (SSFs) to enhance classification performance in multi-class datasets through hierarchical classification (HC). The method has the unique capability of generating hierarchy without requiring explicit information, making it suitable for datasets lacking prior knowledge of hierarchy. By systematically dividing classes into two subsets based on their discriminability according to the classifier, the proposed approach constructs a binary tree representation of hierarchical classes. The approach is evaluated on 46 multi-class time series datasets using popular classifiers (svm and rocket) and SSFs (potr, srtr, and lsoo). The results reveal that the approach significantly improves classification performance in approximately half and a third of the datasets when using rocket and svm as the classifier, respectively. The study also explores the relationship between dataset features and HC performance. While the number of classes and flat classification (FC) score show consistent significance, variations are observed with different splitting functions. Overall, the proposed approach presents a promising strategy for enhancing classification by generating hierarchical structure in multi-class time series datasets. Future research directions involve exploring different splitting functions, classifiers, and hierarchy structures, as well as applying the approach to diverse domains beyond time series data. The source code is made openly available to facilitate reproducibility and further exploration of the method.
翻译:本研究提出了一种新颖的层次分裂聚类方法,该方法采用随机分裂函数(SSFs)以通过层次分类(HC)提升多类数据集的分类性能。该方法的独特之处在于无需显式信息即可生成层次结构,从而适用于缺乏先验层次知识的数据集。通过依据分类器对类别的可区分性,将类别系统地划分为两个子集,所提方法构建了层次类别的二叉树表示。本研究使用主流分类器(svm和rocket)及随机分裂函数(potr、srtr和lsoo),在46个多类时间序列数据集上评估该方法。结果表明,当分别使用rocket和svm作为分类器时,该方法在约一半及三分之一的数据集上显著提升了分类性能。研究还探讨了数据集特征与层次分类性能之间的关系。尽管类别数量和平坦分类(FC)得分呈现一致的显著性,但不同分裂函数下存在差异。总体而言,所提方法为通过生成多类时间序列数据集的层次结构提升分类性能提供了一种有前景的策略。未来研究方向包括探索不同的分裂函数、分类器和层次结构,并将该方法应用于时间序列数据以外的多样化领域。源代码已公开提供,以促进研究的可复现性及方法的进一步探索。