This paper introduces the Trinary decision tree, an algorithm designed to improve the handling of missing data in decision tree regressors and classifiers. Unlike other approaches, the Trinary decision tree does not assume that missing values contain any information about the response. Both theoretical calculations on estimator bias and numerical illustrations using real data sets are presented to compare its performance with established algorithms in different missing data scenarios (Missing Completely at Random (MCAR), and Informative Missingness (IM)). Notably, the Trinary tree outperforms its peers in MCAR settings, especially when data is only missing out-of-sample, while lacking behind in IM settings. A hybrid model, the TrinaryMIA tree, which combines the Trinary tree and the Missing In Attributes (MIA) approach, shows robust performance in all types of missingness. Despite the potential drawback of slower training speed, the Trinary tree offers a promising and more accurate method of handling missing data in decision tree algorithms.
翻译:本文介绍了一种三叉决策树算法,旨在改进决策树回归器和分类器中缺失数据的处理方法。与其他方法不同,三叉决策树不假设缺失值包含任何有关响应变量的信息。通过基于估计器偏差的理论计算与使用真实数据集的数值示例,本文展示了在完全随机缺失(MCAR)和信息性缺失(IM)等不同缺失数据场景下,该算法与已有算法的性能对比。值得注意的是,在完全随机缺失设置下,三叉树表现优于同类算法,尤其当数据仅样本外缺失时,但在信息性缺失设置下表现欠佳。一种融合了三叉树与缺失属性(MIA)方法的混合模型——三叉MIA树,在所有缺失类型中均展现出稳健性能。尽管存在训练速度较慢的潜在缺陷,三叉树仍为决策树算法中的缺失数据处理提供了一种前景广阔且精度更高的方法。