Random forests are a statistical learning method widely used in many areas of scientific research because of its ability to learn complex relationships between input and output variables and also its capacity to handle high-dimensional data. However, current random forest approaches are not flexible enough to handle heterogeneous data such as curves, images and shapes. In this paper, we introduce Fr\'echet trees and Fr\'echet random forests, which allow to handle data for which input and output variables take values in general metric spaces. To this end, a new way of splitting the nodes of trees is introduced and the prediction procedures of trees and forests are generalized. Then, random forests out-of-bag error and variable importance score are naturally adapted. A consistency theorem for Fr\'echet regressogram predictor using data-driven partitions is given and applied to Fr\'echet purely uniformly random trees. The method is studied through several simulation scenarios on heterogeneous data combining longitudinal, image and scalar data. Finally, one real dataset about air quality is used to illustrate the use of the proposed method in practice.
翻译:随机森林作为一种统计学习方法,因其能够学习输入与输出变量之间的复杂关系并处理高维数据,被广泛应用于多个科学研究领域。然而,当前随机森林方法在处理曲线、图像和形状等异构数据时缺乏灵活性。本文提出Fréchet树与Fréchet随机森林,可处理输入与输出变量取值于一般度量空间的数据。为此,我们引入一种新的树节点分裂方式,并推广了树与森林的预测流程。随后对随机森林的袋外误差和变量重要性得分进行了自然适配。针对基于数据驱动划分的Fréchet回归图预测器,我们给出了其一致性定理,并将其应用于纯均匀随机Fréchet树。通过包含纵向数据、图像数据与标量数据的多种异构数据模拟场景,对该方法进行了系统研究。最后利用一个真实空气质量数据集,展示了所提方法的实际应用。