We propose soft Hoeffding trees (SoHoT) as a new differentiable and transparent model for possibly infinite and changing data streams. Stream mining algorithms such as Hoeffding trees grow based on the incoming data stream, but they currently lack the adaptability of end-to-end deep learning systems. End-to-end learning can be desirable if a feature representation is learned by a neural network and used in a tree, or if the outputs of trees are further processed in a deep learning model or workflow. Different from Hoeffding trees, soft trees can be integrated into such systems due to their differentiability, but are neither transparent nor explainable. Our novel model combines the extensibility and transparency of Hoeffding trees with the differentiability of soft trees. We introduce a new gating function to regulate the balance between univariate and multivariate splits in the tree. Experiments are performed on 20 data streams, comparing SoHoT to standard Hoeffding trees, Hoeffding trees with limited complexity, and soft trees applying a sparse activation function for sample routing. The results show that soft Hoeffding trees outperform Hoeffding trees in estimating class probabilities and, at the same time, maintain transparency compared to soft trees, with relatively small losses in terms of AUROC and cross-entropy. We also demonstrate how to trade off transparency against performance using a hyperparameter, obtaining univariate splits at one end of the spectrum and multivariate splits at the other.
翻译:本文提出软霍夫丁树(SoHoT)作为一种面向可能无限且动态变化数据流的新型可微分透明模型。以霍夫丁树为代表的数据流挖掘算法虽能基于输入数据流动态生长,但目前仍缺乏端到端深度学习系统的自适应能力。当神经网络学习到的特征表示需在树模型中使用,或树模型输出需在深度学习模型及工作流中进一步处理时,端到端学习机制具有显著优势。与传统霍夫丁树不同,软树凭借其可微性可集成至此类系统,但缺乏透明性与可解释性。我们提出的新型模型融合了霍夫丁树的可扩展性与透明度,以及软树的可微特性。通过引入新型门控函数,该模型可调节树中单变量分割与多变量分割的平衡机制。在20组数据流上的实验表明,相较于标准霍夫丁树、有限复杂度霍夫丁树以及采用稀疏激活函数进行样本路由的软树,软霍夫丁树在类别概率估计方面表现更优,同时相较于软树保持了透明度,仅在AUROC与交叉熵指标上存在较小损失。研究还通过超参数调控实现了透明度与性能的权衡,在参数谱的一端获得单变量分割,另一端实现多变量分割。