Assessing communication and collaboration at scale depends on a labor intensive task of coding communication data into categories according to different frameworks. Prior research has established that ChatGPT can be directly instructed with coding rubrics to code the communication data and achieves accuracy comparable to human raters. However, whether the coding from ChatGPT or similar AI technology perform consistently across different demographic groups, such as gender and race, remains unclear. To address this gap, we introduce three checks for evaluating subgroup consistency in LLM-based coding by adapting an existing framework from the automated scoring literature. Using a typical collaborative problem-solving coding framework and data from three types of collaborative tasks, we examine ChatGPT-based coding performance across gender and racial/ethnic groups. Our results show that ChatGPT-based coding perform consistently in the same way as human raters across gender or racial/ethnic groups, demonstrating the possibility of its use in large-scale assessments of collaboration and communication.
翻译:大规模评估通信与协作能力依赖于一项劳动密集型任务:依据不同框架将通信数据编码为相应类别。已有研究证实,ChatGPT可通过直接输入编码规则对通信数据进行分类,其准确率与人类评分员相当。然而,ChatGPT或类似人工智能技术生成的编码结果在不同人口统计群体(如性别与种族)间是否保持一致性,目前尚不明确。为填补这一研究空白,我们通过改进自动评分文献中的现有框架,提出了三项用于评估基于大语言模型的编码在子群组间一致性的检验方法。采用典型的协作问题解决编码框架及三类协作任务数据,我们检验了基于ChatGPT的编码在性别与种族/族裔群体间的表现。结果表明,基于ChatGPT的编码在跨性别或种族/族裔群体中展现出与人类评分员同等水平的一致性,这证明其在大规模协作与通信评估中具有应用潜力。