Sparse Bayesian Multidimensional Item Response Theory

Multivariate Item Response Theory (MIRT) is sought-after widely by applied researchers looking for interpretable (sparse) explanations underlying response patterns in questionnaire data. There is, however, an unmet demand for such sparsity discovery tools in practice. Our paper develops a Bayesian platform for binary and ordinal item MIRT which requires minimal tuning and scales well on large datasets due to its parallelizable features. Bayesian methodology for MIRT models has traditionally relied on MCMC simulation, which cannot only be slow in practice, but also often renders exact sparsity recovery impossible without additional thresholding. In this work, we develop a scalable Bayesian EM algorithm to estimate sparse factor loadings from mixed continuous, binary, and ordinal item responses. We address the seemingly insurmountable problem of unknown latent factor dimensionality with tools from Bayesian nonparametrics which enable estimating the number of factors. Rotations to sparsity through parameter expansion further enhance convergence and interpretability without identifiability constraints. In our simulation study, we show that our method reliably recovers both the factor dimensionality as well as the latent structure on high-dimensional synthetic data even for small samples. We demonstrate the practical usefulness of our approach on three datasets: an educational assessment dataset, a quality-of-life measurement dataset, and a bio-behavioral dataset. All demonstrations show that our tool yields interpretable estimates, facilitating interesting discoveries that might otherwise go unnoticed under a pure confirmatory factor analysis setting.

翻译：多维项目反应理论（MIRT）广泛应用于应用研究者群体，旨在从问卷数据反应模式中寻找可解释的（稀疏）潜在结构。然而，实践中对这类稀疏性发现工具的需求尚未得到满足。本文开发了一个适用于二分类和有序分类项目的贝叶斯框架，该框架因具有并行化特性而只需极少的参数调优，并能在大规模数据集上良好扩展。传统上，MIRT模型的贝叶斯方法依赖于马尔可夫链蒙特卡洛（MCMC）模拟，这不仅在实践中可能运行缓慢，且往往需要额外阈值处理才能实现精确的稀疏性恢复。本研究开发了一种可扩展的贝叶斯期望最大化算法，用于从混合连续型、二分类和有序分类项目响应中估计稀疏因子载荷。我们利用贝叶斯非参数方法解决了未知潜在因子维度的棘手问题，从而实现了因子数量的估计。通过参数扩展的旋转稀疏化方法，在无辨识约束条件下进一步提升了收敛性和可解释性。模拟研究表明，即使在样本量较小的高维合成数据上，我们的方法也能可靠地恢复因子维度和潜在结构。我们通过三个数据集验证了该方法的实际应用价值：教育评估数据集、生活质量测量数据集和生物行为数据集。所有实验均表明，本工具能提供可解释的估计结果，有助于发现纯验证性因子分析框架中可能被忽略的有趣现象。