We evaluate how well LLMs understand African American Language (AAL) in comparison to their performance on White Mainstream English (WME), the encouraged "standard" form of English taught in American classrooms. We measure LLM performance using automatic metrics and human judgments for two tasks: a counterpart generation task, where a model generates AAL (or WME) given WME (or AAL), and a masked span prediction (MSP) task, where models predict a phrase that was removed from their input. Our contributions include: (1) evaluation of six pre-trained, large language models on the two language generation tasks; (2) a novel dataset of AAL text from multiple contexts (social media, hip-hop lyrics, focus groups, and linguistic interviews) with human-annotated counterparts in WME; and (3) documentation of model performance gaps that suggest bias and identification of trends in lack of understanding of AAL features.
翻译:我们评估了大语言模型(LLMs)对非裔美式英语(AAL)的理解能力,并将其与白人主流英语(WME)——美国课堂所推崇的“标准”英语形式——的性能进行对比。我们通过自动评估指标和人工判断,衡量模型在两项任务中的表现:对等生成任务(模型根据WME生成AAL,或反之)以及掩码跨度预测任务(MSP,模型预测被移除的短语)。我们的贡献包括:(1)在两项语言生成任务中评估六种预训练大语言模型;(2)构建一个包含多语境(社交媒体、嘻哈歌词、焦点小组访谈及语言访谈)AAL文本的新型数据集,并附有人工标注的WME对等译文;(3)揭示反映偏见的模型性能差距,并识别缺乏对AAL特征理解的趋势。