Classification using sparse oblique random forests provides guarantees on uncertainty and confidence while controlling for specific error types. However, they use more data and more compute than other tree ensembles because they create deep trees and need to sort or histogram linear combinations of data at runtime. We provide a method for dynamically switching between histograms and sorting to find the best split. We further optimize histogram construction using vector intrinsics. Evaluating this on large datasets, our optimizations speedup training by 1.7-2.5x compared to existing oblique forests and 1.5-2x compared to standard random forests. We also provide a GPU and hybrid CPU-GPU implementation.
翻译:使用稀疏斜随机森林进行分类能够在控制特定错误类型的同时,为不确定性和置信度提供理论保证。然而,由于需要构建深度树并在运行时对数据的线性组合进行排序或直方图统计,它们比其他树集成方法消耗更多的数据和计算资源。本文提出了一种在直方图与排序之间动态切换以寻找最佳分割点的方法。我们进一步利用向量内禀指令优化了直方图的构建过程。在大型数据集上的评估结果表明,相较于现有斜向森林方法,我们的优化将训练速度提升了1.7-2.5倍;相较于标准随机森林,则实现了1.5-2倍的加速。此外,我们还提供了GPU及CPU-GPU混合架构的实现方案。