应用项目反应理论区分人类与生成式AI在多项选择题评估中的回答 (Applying IRT to Distinguish Between Human and Generative AI Responses to Multiple-Choice Assessments)

Generative AI is transforming the educational landscape, raising significant concerns about cheating. Despite the widespread use of multiple-choice questions in assessments, the detection of AI cheating in MCQ-based tests has been almost unexplored, in contrast to the focus on detecting AI-cheating on text-rich student outputs. In this paper, we propose a method based on the application of Item Response Theory to address this gap. Our approach operates on the assumption that artificial and human intelligence exhibit different response patterns, with AI cheating manifesting as deviations from the expected patterns of human responses. These deviations are modeled using Person-Fit Statistics. We demonstrate that this method effectively highlights the differences between human responses and those generated by premium versions of leading chatbots (ChatGPT, Claude, and Gemini), but that it is also sensitive to the amount of AI cheating in the data. Furthermore, we show that the chatbots differ in their reasoning profiles. Our work provides both a theoretical foundation and empirical evidence for the application of IRT to identify AI cheating in MCQ-based assessments.

翻译：生成式人工智能正在改变教育格局，引发了关于作弊行为的重大关切。尽管多项选择题在评估中被广泛使用，但与针对文本密集型学生输出的AI作弊检测研究相比，基于多项选择题测试的AI作弊检测领域几乎未被探索。本文提出一种基于项目反应理论应用的方法以填补这一空白。我们的方法基于以下假设：人工智能与人类智能展现出不同的应答模式，AI作弊行为表现为偏离人类预期应答模式的异常。这些偏差通过个人拟合统计量进行建模。我们证明该方法能有效凸显人类应答与主流高级聊天机器人（ChatGPT、Claude和Gemini）生成应答之间的差异，同时对数据中AI作弊的数量具有敏感性。此外，我们发现不同聊天机器人具有相异的推理特征。本研究为应用项目反应理论识别多项选择题评估中的AI作弊行为提供了理论基础与实证依据。

相关内容

关注 7093

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日