Statistical Inference for Covariate-Adjusted and Interpretable Generalized Factor Model with Application to Testing Fairness

In the era of data explosion, statisticians have been developing interpretable and computationally efficient statistical methods to measure latent factors (e.g., skills, abilities, and personalities) using large-scale assessment data. In addition to understanding the latent information, the covariate effect on responses controlling for latent factors is also of great scientific interest and has wide applications, such as evaluating the fairness of educational testing, where the covariate effect reflects whether a test question is biased toward certain individual characteristics (e.g., gender and race) taking into account their latent abilities. However, the large sample size, substantial covariate dimension, and great test length pose challenges to developing efficient methods and drawing valid inferences. Moreover, to accommodate the commonly encountered discrete types of responses, nonlinear latent factor models are often assumed, bringing further complexity to the problem. To address these challenges, we consider a covariate-adjusted generalized factor model and develop novel and interpretable conditions to address the identifiability issue. Based on the identifiability conditions, we propose a joint maximum likelihood estimation method and establish estimation consistency and asymptotic normality results for the covariate effects under a practical yet challenging asymptotic regime. Furthermore, we derive estimation and inference results for latent factors and the factor loadings. We illustrate the finite sample performance of the proposed method through extensive numerical studies and an application to an educational assessment dataset obtained from the Programme for International Student Assessment (PISA).

翻译：在数据爆炸的时代，统计学家们致力于开发可解释且计算高效的统计方法，利用大规模测评数据来测量潜在因子（如技能、能力和人格特质）。除了解析潜在信息外，控制潜在因子条件下协变量对响应的影响同样具有重要的科学价值，并在诸多领域有广泛应用——例如评估教育测试的公平性：其中协变量效应反映了在考虑考生潜在能力的前提下，测试题目是否对某些个体特征（如性别和种族）存在偏向性。然而，大样本量、高维协变量以及长测试长度给开发高效方法和进行有效推断带来了挑战。此外，为应对常见的离散型响应数据，通常需采用非线性潜在因子模型，这进一步增加了问题的复杂性。针对这些挑战，我们提出了一种协变量调整的广义因子模型，并开发了新颖且可解释的条件来解决模型可识别性问题。基于这些可识别条件，我们提出了联合极大似然估计方法，并在实际且富有挑战性的渐近框架下，建立了协变量效应估计的一致性和渐近正态性结果。进一步地，我们还推导了潜在因子和因子载荷的估计与推断结果。通过大量数值模拟以及基于国际学生评估项目（PISA）数据集的实证应用，我们验证了所提方法在有限样本下的性能表现。