This study delves into the pervasive issue of gender issues in artificial intelligence (AI), specifically within automatic scoring systems for student-written responses. The primary objective is to investigate the presence of gender biases, disparities, and fairness in generally targeted training samples with mixed-gender datasets in AI scoring outcomes. Utilizing a fine-tuned version of BERT and GPT-3.5, this research analyzes more than 1000 human-graded student responses from male and female participants across six assessment items. The study employs three distinct techniques for bias analysis: Scoring accuracy difference to evaluate bias, mean score gaps by gender (MSG) to evaluate disparity, and Equalized Odds (EO) to evaluate fairness. The results indicate that scoring accuracy for mixed-trained models shows an insignificant difference from either male- or female-trained models, suggesting no significant scoring bias. Consistently with both BERT and GPT-3.5, we found that mixed-trained models generated fewer MSG and non-disparate predictions compared to humans. In contrast, compared to humans, gender-specifically trained models yielded larger MSG, indicating that unbalanced training data may create algorithmic models to enlarge gender disparities. The EO analysis suggests that mixed-trained models generated more fairness outcomes compared with gender-specifically trained models. Collectively, the findings suggest that gender-unbalanced data do not necessarily generate scoring bias but can enlarge gender disparities and reduce scoring fairness.
翻译:本研究深入探讨了人工智能(AI)中的性别问题,特别是在针对学生书面回答的自动评分系统中。主要目标是调查在一般性目标训练样本(包含混合性别数据集)中,AI评分结果是否存在性别偏见、差异与公平问题。通过使用微调版本的BERT和GPT-3.5,本研究分析了来自男女参与者、涉及六个评估项目的1000多份人工评分的学生回答。研究采用了三种不同的偏见分析技术:评分准确率差异评估偏见、按性别划分的平均分差(MSG)评估差异,以及等几率(EO)评估公平性。结果表明,混合训练模型的评分准确率与仅用男性或女性训练模型相比无显著差异,表明不存在明显的评分偏见。与BERT和GPT-3.5一致,我们发现混合训练模型产生的MSG更少,且预测结果与非差异人工评分相比更一致。相反,与人工评分相比,特定性别训练模型产生了更大的MSG,表明不平衡的训练数据可能使算法模型放大性别差异。EO分析表明,与特定性别训练模型相比,混合训练模型产生了更公平的结果。综合来看,研究结果表明性别不平衡数据不一定导致评分偏见,但可能放大性别差异并降低评分公平性。