Tree ensemble methods such as Random Forests naturally induce supervised similarity measures through their decision tree structure, but existing implementations of proximities derived from tree ensembles typically suffer from quadratic time or memory complexity, limiting their scalability. In this work, we introduce a general framework for efficient proximity computation by defining a family of Separable Weighted Leaf-Collision Proximities. We show that any proximity measure in this family admits an exact sparse matrix factorization, restricting computation to leaf-level collisions and avoiding explicit pairwise comparisons. This formulation enables low-memory, scalable proximity computation using sparse linear algebra in Python. Empirical benchmarks demonstrate substantial runtime and memory improvements over traditional approaches, allowing tree ensemble proximities to scale efficiently to datasets with hundreds of thousands of samples on standard CPU hardware.
翻译:随机森林等树集成方法通过其决策树结构自然诱导出监督相似性度量,但现有基于树集成的邻近度实现通常存在二次时间复杂度或内存复杂度问题,限制了其可扩展性。本研究通过定义一类可分离加权叶节点碰撞邻近度,提出了高效邻近度计算的通用框架。我们证明该族中的任何邻近度度量都允许精确的稀疏矩阵分解,将计算限制在叶节点碰撞层面,避免显式的成对比较。该公式支持在Python中使用稀疏线性代数进行低内存、可扩展的邻近度计算。实证基准测试表明,与传统方法相比,该方法在运行时间和内存使用方面均有显著改进,使得树集成邻近度计算能够在标准CPU硬件上高效扩展至数十万样本规模的数据集。