Reimagining Assessment in the Age of Generative AI: Lessons from Open-Book Exams with ChatGPT

Generative AI systems such as ChatGPT challenge traditional assumptions about academic assessment by enabling students to generate explanations, code, and solutions in real time. Rather than attempting to restrict AI use, this study investigates how students actually interact with such systems during formal evaluation. Engineering students were permitted to use ChatGPT during take-home open-book exams and were required to submit interaction transcripts alongside exam solutions. This provided direct observational evidence of reasoning processes rather than relying on self-reported behavior. Qualitative analysis revealed three progressive patterns of use: answer retrieval, guided collaboration, and critical verification. While some students initially copied questions verbatim and received generic responses, many refined prompts iteratively and tested outputs. Some of the strongest evidence of reasoning appeared when students evaluated incorrect or incomplete AI responses, revealing evaluative reasoning through debugging, comparison, and justification. The presence of generative AI shifted the cognitive task of assessment from producing solutions to assessing solution validity. The findings suggest that, in AI-mediated assessment environments, correctness of final answers alone may no longer provide sufficient evidence of comprehension. Instead, competencies such as prompt formulation, verification, and judgment become visible indicators of learning. Transparent integration of AI appeared to reduce focus on rule avoidance and promote self-regulation. Assessments should evolve to evaluate reasoning about solutions rather than independent solution production. Generative AI therefore does not invalidate assessment but has the potential to expose deeper forms of understanding aligned with professional practice.

翻译：诸如ChatGPT这类生成式AI系统通过使学生能够实时生成解释、代码和解决方案，对传统学术评估假设提出了挑战。本研究并未试图限制AI使用，而是调查学生在正式评估中实际如何与这类系统互动。工程专业学生在带回家的开卷考试中被允许使用ChatGPT，并要求在提交考试答案的同时附上交互记录。这提供了推理过程的直接观察证据，而非依赖自我报告行为。定性分析揭示了三种渐进式使用模式：答案检索、引导式协作和批判性验证。虽然部分学生起初直接复制问题并获得泛化回应，但许多人迭代优化提示词并测试输出结果。当学生评估错误或不完整的AI回答时，通过调试、比较和论证展现出评估性推理能力，这成为最有力的推理证据之一。生成式AI的出现将评估的认知任务从“产生解决方案”转变为“评估解决方案的有效性”。研究结果表明，在AI中介的评估环境中，最终答案的正确性本身可能不足以证明理解水平。相反，提示词构建、验证和判断等能力成为学习情况的可视化指标。AI的透明整合似乎减少了对规则规避的关注，并促进了自我调节。评估应转向评估对解决方案的推理过程，而非独立的解决方案生成能力。因此，生成式AI并未使评估失效，反而有可能揭示与专业实践相一致的更深层次理解形式。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/