Categorical variables are of uttermost importance in biomedical research. When two of them are considered, it is often the case that one wants to test whether or not they are statistically dependent. We show weaknesses of classical methods -- such as Pearson's and the G-test -- and we propose testing strategies based on distances that lack those drawbacks. We first develop this theory for classical two-dimensional contingency tables, within the context of distance covariance, an association measure that characterises general statistical independence of two variables. We then apply the same fundamental ideas to one-dimensional tables, namely to the testing for goodness of fit to a discrete distribution, for which we resort to an analogous statistic called energy distance. We prove that our methodology has desirable theoretical properties, and we show how we can calibrate the null distribution of our test statistics without resorting to any resampling technique. We illustrate all this in simulations, as well as with some real data examples, demonstrating the adequate performance of our approach for biostatistical practice.
翻译:类别变量在生物医学研究中至关重要。当考虑两个类别变量时,通常需要检验它们之间是否存在统计依赖关系。我们展示了经典方法(如皮尔逊卡方检验和G检验)的局限性,并提出基于距离的检验策略以克服这些缺陷。我们首先在距离协方差的框架内发展了经典二维列联表的理论。距离协方差是一种能够刻画两个变量广义统计独立性的关联度量。随后,我们将相同的基本思想应用于一维列联表,即检验离散分布的拟合优度,为此我们采用了一种名为能量距离的类似统计量。我们证明了该方法具有良好的理论性质,并展示了如何在不借助任何重采样技术的情况下校准检验统计量的零分布。通过模拟实验和真实数据示例,我们验证了该方法在生物统计实践中具有令人满意的表现。