In this paper, we extend distance correlation to categorical data with general encodings, such as one-hot encoding for nominal variables and semicircle encoding for ordinal variables. Unlike existing methods, our approach leverages the spacing information between categories, which enhances the performance of distance correlation. Two estimates including the maximum likelihood estimate and a bias-corrected estimate are given, together with their limiting distributions under the null and alternative hypotheses. Furthermore, we establish the sure screening property for high-dimensional categorical data under mild conditions. We conduct a simulation study to compare the performance of different encodings, and illustrate their practical utility using the 2018 General Social Survey data.
翻译:本文提出了一种适用于广义编码分类数据的距离相关性方法,例如针对名义变量的独热编码和针对有序变量的半圆编码。与现有方法不同,我们的方法利用了类别之间的间距信息,从而提升了距离相关性的性能。我们给出了最大似然估计和偏差校正估计两种估计量,并推导了它们在原假设与备择假设下的极限分布。此外,我们在温和条件下建立了高维分类数据的确定筛选性质。通过模拟研究比较了不同编码方式的性能,并利用2018年综合社会调查数据展示了其实际应用价值。