OpenAI o1能否在高级认知思维方面超越人类？ (Can OpenAI o1 outperform humans in higher-order cognitive thinking?)

This study evaluates the performance of OpenAI's o1-preview model in higher-order cognitive domains, including critical thinking, systematic thinking, computational thinking, data literacy, creative thinking, logical reasoning, and scientific reasoning. Using established benchmarks, we compared the o1-preview models's performance to human participants from diverse educational levels. o1-preview achieved a mean score of 24.33 on the Ennis-Weir Critical Thinking Essay Test (EWCTET), surpassing undergraduate (13.8) and postgraduate (18.39) participants (z = 1.60 and 0.90, respectively). In systematic thinking, it scored 46.1, SD = 4.12 on the Lake Urmia Vignette, significantly outperforming the human mean (20.08, SD = 8.13, z = 3.20). For data literacy, o1-preview scored 8.60, SD = 0.70 on Merk et al.'s "Use Data" dimension, compared to the human post-test mean of 4.17, SD = 2.02 (z = 2.19). On creative thinking tasks, the model achieved originality scores of 2.98, SD = 0.73, higher than the human mean of 1.74 (z = 0.71). In logical reasoning (LogiQA), it outperformed humans with average 90%, SD = 10% accuracy versus 86%, SD = 6.5% (z = 0.62). For scientific reasoning, it achieved near-perfect performance (mean = 0.99, SD = 0.12) on the TOSLS,, exceeding the highest human scores of 0.85, SD = 0.13 (z = 1.78). While o1-preview excelled in structured tasks, it showed limitations in problem-solving and adaptive reasoning. These results demonstrate the potential of AI to complement education in structured assessments but highlight the need for ethical oversight and refinement for broader applications.

翻译：本研究评估了OpenAI的o1-preview模型在高级认知领域的表现，包括批判性思维、系统性思维、计算思维、数据素养、创造性思维、逻辑推理和科学推理。通过使用成熟的基准测试，我们将o1-preview模型的性能与不同教育水平的人类参与者进行了比较。在Ennis-Weir批判性思维论文测试（EWCTET）中，o1-preview的平均得分为24.33，超过了本科生（13.8）和研究生（18.39）参与者（z值分别为1.60和0.90）。在系统性思维方面，该模型在乌尔米亚湖情景测试中得分为46.1，标准差为4.12，显著优于人类平均分（20.08，标准差8.13，z = 3.20）。在数据素养方面，o1-preview在Merk等人的“使用数据”维度上得分为8.60，标准差0.70，而人类后测平均分为4.17，标准差2.02（z = 2.19）。在创造性思维任务中，该模型的原创性得分为2.98，标准差0.73，高于人类平均分1.74（z = 0.71）。在逻辑推理（LogiQA）方面，其平均准确率为90%，标准差10%，优于人类的86%，标准差6.5%（z = 0.62）。在科学推理方面，该模型在TOSLS测试中接近完美表现（均值0.99，标准差0.12），超过了人类最高得分0.85，标准差0.13（z = 1.78）。尽管o1-preview在结构化任务中表现出色，但在问题解决和适应性推理方面存在局限性。这些结果表明了人工智能在结构化评估中辅助教育的潜力，但也强调了在更广泛的应用中需要伦理监督和改进。