Decision forests induce supervised similarities through the partition structure of their trees. Yet forest proximity computation is still often treated as a quadratic operation in the number of samples, which limits scalability and restricts broader use in kernel and representation-learning pipelines. We introduce a unified view of leaf-collision forest proximities through a class of Separable Weighted Leaf-Collision (SWLC) kernels, showing that most existing proximities differ only in their weighting scheme while sharing a common sparse leaf-incidence structure. This yields an explicit leaf-space representation that clarifies their kernel interpretation and leads to an exact finite-sample sparse factorization of the proximity matrix, avoiding an explicit all-pairs comparison and reducing computation to sparse linear algebra over leaf collisions. We implement this framework in a memory-efficient Python library and show, both theoretically and empirically, that exact kernel computation scales near-linearly in time and memory under standard forest regimes. Benchmarks verify the predicted scaling behavior in practice across datasets, proximity definitions, and forest settings, and show that the resulting sparse leaf-space representation can also be used directly for fast task-aware embedding.
翻译:决策森林通过其树的划分结构诱导出有监督的相似性。然而,森林邻近度的计算通常仍被视为样本数量的二次操作,这限制了可扩展性,并阻碍了其在核方法和表示学习流程中的更广泛使用。我们通过一类可分离加权叶碰撞(SWLC)核引入叶碰撞森林邻近度的统一视角,表明大多数现有邻近度的差异仅在于其加权方案,而共享共同的稀疏叶-实例结构。这产生了一个显式的叶空间表示,阐明了其核解释,并导致了邻近度矩阵的一个精确有限样本稀疏分解,避免了显式的全对比较,并将计算简化为叶碰撞上的稀疏线性代数。我们在一个内存高效的Python库中实现了该框架,并从理论和实证上表明,在标准森林机制下,精确的核计算在时间和内存上接近线性缩放。基准测试验证了跨数据集、邻近度定义和森林设置下的预测缩放行为,并表明所得到的稀疏叶空间表示也可直接用于快速任务感知的嵌入。