Datasets containing both categorical and continuous variables are frequently encountered in many areas, and with the rapid development of modern measurement technologies, the dimensions of these variables can be very high. Despite the recent progress made in modelling high-dimensional data for continuous variables, there is a scarcity of methods that can deal with a mixed set of variables. To fill this gap, this paper develops a novel approach for classifying high-dimensional observations with mixed variables. Our framework builds on a location model, in which the distributions of the continuous variables conditional on categorical ones are assumed Gaussian. We overcome the challenge of having to split data into exponentially many cells, or combinations of the categorical variables, by kernel smoothing, and provide new perspectives for its bandwidth choice to ensure an analogue of Bochner's Lemma, which is different to the usual bias-variance tradeoff. We show that the two sets of parameters in our model can be separately estimated and provide penalized likelihood for their estimation. Results on the estimation accuracy and the misclassification rates are established, and the competitive performance of the proposed classifier is illustrated by extensive simulation and real data studies.
翻译:在许多领域中,经常遇到同时包含分类变量和连续变量的数据集,而随着现代测量技术的快速发展,这些变量的维度可能非常高。尽管近年来在高维连续变量建模方面取得了进展,但能够处理混合变量集的方法仍然稀缺。为填补这一空白,本文提出了一种对高维混合变量观测数据进行分类的新方法。我们的框架基于位置模型,其中连续变量在分类变量条件下的分布假设为高斯分布。我们通过核平滑克服了必须将数据分割为指数级数量的单元(即分类变量的组合)的挑战,并为其带宽选择提供了新的视角,以确保类似于波赫纳引理的结论——这与常见的偏差-方差权衡有所不同。我们证明模型中的两组参数可以分别估计,并提出了用于其估计的惩罚似然方法。本文建立了估计精度和误分类率的相关结论,并通过大量模拟和真实数据研究展示了所提出分类器的竞争性能。