Contingency tables are a fundamental representation of multivariate categorical data. As the size of the contingency table grows exponentially with the number of variables, even a moderate number of variables, each with a moderate number of levels, will result in a huge number of cells, the majority of which will remain empty even with a significant amount of data. We propose an efficient method for inferring higher-order loglinear models in such scenarios. We tackle the computational challenge by using only a sample of the empty cells and deriving the associated likelihood under a Poisson sampling scheme. This allows us to define an iteratively re-weighted least squares (IRWLS) algorithm for parameter estimation. Under the extreme setting of huge contingency tables, we show how standard Poisson regression on the sampled data converges to this IRWLS scheme, when the number of sampled empty cells exceeds the number of observations. We illustrate the method with an analysis of data from the General Social Survey, which consists of 15014 observations in a 70-dimensional contingency table with a total of 2.6 x 10^{39} cells.
翻译:列联表是多元分类数据的基本表示形式。由于列联表的规模随变量数量呈指数级增长,即使变量数量适中且各变量水平数有限,也会产生海量的单元格;即便数据量充足,其中绝大多数单元格仍将保持空值。本文针对此类场景提出一种高效推断高阶对数线性模型的方法。我们通过仅对空单元格进行抽样,并在泊松抽样方案下推导相应似然函数,以应对计算挑战。这使得我们能够定义一种用于参数估计的迭代重加权最小二乘(IRWLS)算法。在超高维列联表的极端场景下,当抽样空单元格数量超过观测值时,我们证明了基于抽样数据的标准泊松回归如何收敛至该IRWLS方案。我们通过对综合社会调查数据的分析来验证该方法,该数据集包含15014个观测值,构成70维列联表,总单元格数达2.6×10^{39}个。