Item Response Theory (IRT) models aim to assess latent abilities of $n$ examinees along with latent difficulty characteristics of $m$ test items from categorical data that indicates the quality of their corresponding answers. Classical psychometric assessments are based on a relatively small number of examinees and items, say a class of $200$ students solving an exam comprising $10$ problems. More recent global large scale assessments such as PISA, or internet studies, may lead to significantly increased numbers of participants. Additionally, in the context of Machine Learning where algorithms take the role of examinees and data analysis problems take the role of items, both $n$ and $m$ may become very large, challenging the efficiency and scalability of computations. To learn the latent variables in IRT models from large data, we leverage the similarity of these models to logistic regression, which can be approximated accurately using small weighted subsets called coresets. We develop coresets for their use in alternating IRT training algorithms, facilitating scalable learning from large data.
翻译:项目反应理论(IRT)模型旨在通过分类数据评估$n$名受试者的潜在能力以及$m$个测试项目的潜在难度特征,这些数据反映了相应答案的质量。经典心理测量评估基于相对较少的受试者和项目,例如一个由$200$名学生组成的班级完成包含$10$道问题的考试。而近期如PISA等全球大规模评估或互联网研究,可能导致参与者数量显著增加。此外,在机器学习背景下,当算法扮演受试者角色、数据分析问题扮演项目角色时,$n$和$m$可能变得非常大,从而对计算的效率和可扩展性构成挑战。为从大规模数据中学习IRT模型的潜在变量,我们利用这些模型与逻辑回归的相似性,后者可通过称为核心集的小型加权子集进行精确近似。我们开发了适用于交替IRT训练算法的核心集,从而支持从大规模数据中进行可扩展学习。