Existing computerized Adaptive Testing (CAT) frameworks are typically built on predicting the correctness of a student response to a question. Although effective, this approach fails to leverage textual information in questions and responses, especially for open-ended questions. In this work, we propose GENCAT (\textbf{GEN}erative \textbf{CAT}), a novel CAT framework that leverages Large Language Models for knowledge estimate and question selection. First, we develop a Generative Item Response Theory (GIRT) model that enables us to estimate student knowledge from their open-ended responses and predict responses to unseen questions. We train the model in a two-step process, first via Supervised Fine-Tuning and then via preference optimization for knowledge-response alignment. Second, we introduce three question selection algorithms that leverage the generative capabilities of the GIRT model, based on the uncertainty, linguistic diversity, and information of sampled student responses. Third, we conduct experiments on two real-world programming datasets and demonstrate that GENCAT outperforms existing CAT baselines, achieving an AUC improvement of up to 4.32\% in the key early testing stages.
翻译:现有的计算机化自适应测试(CAT)框架通常基于预测学生对试题回答的正确性构建。尽管有效,该方法未能充分利用试题与回答中的文本信息,尤其对于开放式问题。本文提出GENCAT(**GEN**erative **CAT**),一种利用大语言模型进行知识评估与试题选择的新型CAT框架。首先,我们开发了一种生成式项目反应理论(GIRT)模型,该模型能够从学生的开放式回答中评估其知识水平,并预测其对未见过问题的回答。我们通过两步流程训练该模型:先进行监督微调,再通过偏好优化实现知识与回答的对齐。其次,我们基于采样学生回答的不确定性、语言多样性和信息量,引入了三种利用GIRT模型生成能力的试题选择算法。最后,我们在两个真实世界编程数据集上进行实验,结果表明GENCAT优于现有CAT基线,在关键早期测试阶段实现了高达4.32%的AUC提升。