Hyperbolic space is becoming a popular choice for representing data due to the hierarchical structure - whether implicit or explicit - of many real-world datasets. Along with it comes a need for algorithms capable of solving fundamental tasks, such as classification, in hyperbolic space. Recently, multiple papers have investigated hyperbolic alternatives to hyperplane-based classifiers, such as logistic regression and SVMs. While effective, these approaches struggle with more complex hierarchical data. We, therefore, propose to generalize the well-known random forests to hyperbolic space. We do this by redefining the notion of a split using horospheres. Since finding the globally optimal split is computationally intractable, we find candidate horospheres through a large-margin classifier. To make hyperbolic random forests work on multi-class data and imbalanced experiments, we furthermore outline a new method for combining classes based on their lowest common ancestor and a class-balanced version of the large-margin loss. Experiments on standard and new benchmarks show that our approach outperforms both conventional random forest algorithms and recent hyperbolic classifiers.
翻译:双曲空间由于许多真实世界数据集(无论显式或隐式)具有层次结构,正成为表示数据的流行选择。随之而来的是需要能够解决分类等基本任务的算法。近期,多篇论文研究了基于超平面的分类器(如逻辑回归和支持向量机)在双曲空间中的替代方案。尽管这些方法有效,但在处理更复杂的层次数据时仍存在局限。因此,我们提出将著名的随机森林算法推广到双曲空间。通过利用等距球面重新定义分裂的概念。由于寻找全局最优分裂在计算上不可行,我们通过大间隔分类器寻找候选等距球面。为使双曲随机森林能处理多类数据和不平衡实验,我们进一步提出了一种基于最低公共祖先的新类组合方法,以及大间隔损失的类平衡版本。在标准和新基准上的实验表明,我们的方法优于传统随机森林算法和近期提出的双曲分类器。