Linear Discriminant Analysis with High-dimensional Mixed Variables

Datasets containing both categorical and continuous variables are frequently encountered in many areas, and with the rapid development of modern measurement technologies, the dimensions of these variables can be very high. Despite the recent progress made in modelling high-dimensional data for continuous variables, there is a scarcity of methods that can deal with a mixed set of variables. To fill this gap, this paper develops a novel approach for classifying high-dimensional observations with mixed variables. Our framework builds on a location model, in which the distributions of the continuous variables conditional on categorical ones are assumed Gaussian. We overcome the challenge of having to split data into exponentially many cells, or combinations of the categorical variables, by kernel smoothing, and provide new perspectives for its bandwidth choice to ensure an analogue of Bochner's Lemma, which is different to the usual bias-variance tradeoff. We show that the two sets of parameters in our model can be separately estimated and provide penalized likelihood for their estimation. Results on the estimation accuracy and the misclassification rates are established, and the competitive performance of the proposed classifier is illustrated by extensive simulation and real data studies.

翻译：在许多领域中，经常遇到同时包含分类变量和连续变量的数据集，而随着现代测量技术的快速发展，这些变量的维度可能非常高。尽管近年来在高维连续变量建模方面取得了进展，但能够处理混合变量集的方法仍然稀缺。为填补这一空白，本文提出了一种对高维混合变量观测数据进行分类的新方法。我们的框架基于位置模型，其中连续变量在分类变量条件下的分布假设为高斯分布。我们通过核平滑克服了必须将数据分割为指数级数量的单元（即分类变量的组合）的挑战，并为其带宽选择提供了新的视角，以确保类似于波赫纳引理的结论——这与常见的偏差-方差权衡有所不同。我们证明模型中的两组参数可以分别估计，并提出了用于其估计的惩罚似然方法。本文建立了估计精度和误分类率的相关结论，并通过大量模拟和真实数据研究展示了所提出分类器的竞争性能。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日