Hyperbolic space is becoming a popular choice for representing data due to the hierarchical structure - whether implicit or explicit - of many real-world datasets. Along with it comes a need for algorithms capable of solving fundamental tasks, such as classification, in hyperbolic space. Recently, multiple papers have investigated hyperbolic alternatives to hyperplane-based classifiers, such as logistic regression and SVMs. While effective, these approaches struggle with more complex hierarchical data. We, therefore, propose to generalize the well-known random forests to hyperbolic space. We do this by redefining the notion of a split using horospheres. Since finding the globally optimal split is computationally intractable, we find candidate horospheres through a large-margin classifier. To make hyperbolic random forests work on multi-class data and imbalanced experiments, we furthermore outline a new method for combining classes based on their lowest common ancestor and a class-balanced version of the large-margin loss. Experiments on standard and new benchmarks show that our approach outperforms both conventional random forest algorithms and recent hyperbolic classifiers.
翻译:双曲空间因其能体现许多现实世界数据集的层次结构(无论是隐式还是显式)而日益成为数据表示的热门选择。随之而来的是对能够在双曲空间中解决分类等基本任务的算法的需求。近来,多篇论文研究了基于超平面的分类器(如逻辑回归和支持向量机)在双曲空间中的替代方案。这些方法虽然有效,但在处理更复杂的层次化数据时仍面临困难。因此,我们提出将著名的随机森林算法推广到双曲空间。我们通过使用超球面重新定义分割的概念来实现这一目标。由于寻找全局最优分割在计算上是难以处理的,我们通过一个大间隔分类器来寻找候选超球面。为了使双曲随机森林能够处理多类数据和不平衡实验,我们进一步提出了一种基于类的最低共同祖先的类别组合方法,以及一种类别平衡的大间隔损失函数版本。在标准基准和新基准上的实验表明,我们的方法在性能上优于传统的随机森林算法和近期的双曲分类器。