ChatGPT and Bard are AI chatbots based on Large Language Models (LLM) that are slated to promise different applications in diverse areas. In education, these AI technologies have been tested for applications in assessment and teaching. In assessment, AI has long been used in automated essay scoring and automated item generation. One psychometric property that these tools must have to assist or replace humans in assessment is high reliability in terms of agreement between AI scores and human raters. In this paper, we measure the reliability of OpenAI ChatGP and Google Bard LLMs tools against experienced and trained humans in perceiving and rating the complexity of writing prompts. Intraclass correlation (ICC) as a performance metric showed that the inter-reliability of both the OpenAI ChatGPT and the Google Bard were low against the gold standard of human ratings.
翻译:ChatGPT与Bard是基于大语言模型的人工智能聊天机器人,有望在不同领域实现多种应用。在教育领域,这些AI技术已在评估与教学中接受测试。在评估方面,AI长期以来被用于自动作文评分和自动题目生成。要辅助或替代人类进行评估,这些工具必须具备的一项心理测量学特性是与人类评分者之间具有高度一致性信度。本文测量了OpenAI ChatGPT和Google Bard这两种大语言模型工具在与经验丰富且经过培训的人类对写作提示复杂度的感知与评分上的一致性。以组内相关系数作为性能指标,结果表明,OpenAI ChatGPT和Google Bard与人类评分的金标准之间的交互信度均较低。