We evaluate how well LLMs understand African American Language (AAL) in comparison to their performance on White Mainstream English (WME), the encouraged "standard" form of English taught in American classrooms. We measure LLM performance using automatic metrics and human judgments for two tasks: a counterpart generation task, where a model generates AAL (or WME) given WME (or AAL), and a masked span prediction (MSP) task, where models predict a phrase that was removed from their input. Our contributions include: (1) evaluation of six pre-trained, large language models on the two language generation tasks; (2) a novel dataset of AAL text from multiple contexts (social media, hip-hop lyrics, focus groups, and linguistic interviews) with human-annotated counterparts in WME; and (3) documentation of model performance gaps that suggest bias and identification of trends in lack of understanding of AAL features.
翻译:我们评估了大语言模型(LLMs)对非裔美国英语(AAL)的理解程度,并与白人主流英语(WME)——美国课堂中鼓励使用的“标准”英语形式——的表现进行比较。我们通过两个任务的自动评估指标和人工判断来测量LLM性能:一是对应生成任务,即模型根据WME(或AAL)生成AAL(或WME);二是掩码跨度预测(MSP)任务,即模型预测从输入中移除的短语。我们的贡献包括:(1)评估六个预训练大语言模型在两个语言生成任务上的表现;(2)构建一个来自多种语境(社交媒体、嘻哈歌词、焦点小组和语言访谈)的AAL文本新数据集,并附带人工标注的WME对应文本;(3)记录模型性能差距以揭示偏见,并识别出对AAL特征理解不足的趋势。