This study evaluates the performance of OpenAI's o1-preview model in higher-order cognitive domains, including critical thinking, systematic thinking, computational thinking, data literacy, creative thinking, logical reasoning, and scientific reasoning. Using established benchmarks, we compared the o1-preview models's performance to human participants from diverse educational levels. o1-preview achieved a mean score of 24.33 on the Ennis-Weir Critical Thinking Essay Test (EWCTET), surpassing undergraduate (13.8) and postgraduate (18.39) participants (z = 1.60 and 0.90, respectively). In systematic thinking, it scored 46.1, SD = 4.12 on the Lake Urmia Vignette, significantly outperforming the human mean (20.08, SD = 8.13, z = 3.20). For data literacy, o1-preview scored 8.60, SD = 0.70 on Merk et al.'s "Use Data" dimension, compared to the human post-test mean of 4.17, SD = 2.02 (z = 2.19). On creative thinking tasks, the model achieved originality scores of 2.98, SD = 0.73, higher than the human mean of 1.74 (z = 0.71). In logical reasoning (LogiQA), it outperformed humans with average 90%, SD = 10% accuracy versus 86%, SD = 6.5% (z = 0.62). For scientific reasoning, it achieved near-perfect performance (mean = 0.99, SD = 0.12) on the TOSLS,, exceeding the highest human scores of 0.85, SD = 0.13 (z = 1.78). While o1-preview excelled in structured tasks, it showed limitations in problem-solving and adaptive reasoning. These results demonstrate the potential of AI to complement education in structured assessments but highlight the need for ethical oversight and refinement for broader applications.
翻译:本研究评估了OpenAI的o1-preview模型在高级认知领域的表现,包括批判性思维、系统性思维、计算思维、数据素养、创造性思维、逻辑推理和科学推理。通过使用成熟的基准测试,我们将o1-preview模型的性能与不同教育水平的人类参与者进行了比较。在Ennis-Weir批判性思维论文测试(EWCTET)中,o1-preview的平均得分为24.33,超过了本科生(13.8)和研究生(18.39)参与者(z值分别为1.60和0.90)。在系统性思维方面,该模型在乌尔米亚湖情景测试中得分为46.1,标准差为4.12,显著优于人类平均分(20.08,标准差8.13,z = 3.20)。在数据素养方面,o1-preview在Merk等人的“使用数据”维度上得分为8.60,标准差0.70,而人类后测平均分为4.17,标准差2.02(z = 2.19)。在创造性思维任务中,该模型的原创性得分为2.98,标准差0.73,高于人类平均分1.74(z = 0.71)。在逻辑推理(LogiQA)方面,其平均准确率为90%,标准差10%,优于人类的86%,标准差6.5%(z = 0.62)。在科学推理方面,该模型在TOSLS测试中接近完美表现(均值0.99,标准差0.12),超过了人类最高得分0.85,标准差0.13(z = 1.78)。尽管o1-preview在结构化任务中表现出色,但在问题解决和适应性推理方面存在局限性。这些结果表明了人工智能在结构化评估中辅助教育的潜力,但也强调了在更广泛的应用中需要伦理监督和改进。