Generative AI is transforming the educational landscape, raising significant concerns about cheating. Despite the widespread use of multiple-choice questions in assessments, the detection of AI cheating in MCQ-based tests has been almost unexplored, in contrast to the focus on detecting AI-cheating on text-rich student outputs. In this paper, we propose a method based on the application of Item Response Theory to address this gap. Our approach operates on the assumption that artificial and human intelligence exhibit different response patterns, with AI cheating manifesting as deviations from the expected patterns of human responses. These deviations are modeled using Person-Fit Statistics. We demonstrate that this method effectively highlights the differences between human responses and those generated by premium versions of leading chatbots (ChatGPT, Claude, and Gemini), but that it is also sensitive to the amount of AI cheating in the data. Furthermore, we show that the chatbots differ in their reasoning profiles. Our work provides both a theoretical foundation and empirical evidence for the application of IRT to identify AI cheating in MCQ-based assessments.
翻译:生成式人工智能正在改变教育格局,引发了关于作弊行为的重大关切。尽管多项选择题在评估中被广泛使用,但与针对文本密集型学生输出的AI作弊检测研究相比,基于多项选择题测试的AI作弊检测领域几乎未被探索。本文提出一种基于项目反应理论应用的方法以填补这一空白。我们的方法基于以下假设:人工智能与人类智能展现出不同的应答模式,AI作弊行为表现为偏离人类预期应答模式的异常。这些偏差通过个人拟合统计量进行建模。我们证明该方法能有效凸显人类应答与主流高级聊天机器人(ChatGPT、Claude和Gemini)生成应答之间的差异,同时对数据中AI作弊的数量具有敏感性。此外,我们发现不同聊天机器人具有相异的推理特征。本研究为应用项目反应理论识别多项选择题评估中的AI作弊行为提供了理论基础与实证依据。