While large pre-trained language models (LMs) find greater use across NLP, existing evaluation protocols do not consider how LM language use aligns with particular human demographic groups, which can be an important consideration in conversational AI applications. To remedy this gap, we consider how LM language skills can be measured and compared to human sub-populations. We suggest clinical techniques from Speech Language Pathology, which has well-established norms for acquisition of language skills, organized by (human) age. We conduct evaluation with a domain expert (i.e., a clinically licensed speech language pathologist), and also propose automated techniques to substitute clinical evaluation at scale. We find LM capability varies widely depending on task with GPT-3.5 mimicking the ability of a typical 6-9 year old at tasks requiring inference about word meanings and simultaneously outperforming a typical 21 year old at memorization. GPT-3.5 (InstructGPT) also has trouble with social language use, exhibiting less than 50\% of the tested pragmatic skills. It shows errors in understanding particular word parts-of-speech and associative word relations, among other lexical features. Ultimately, findings reiterate the importance of considering demographic alignment and conversational goals when using these models as public-facing tools. Our framework will be publicly available via code, data, and a python package.
翻译:尽管大型预训练语言模型在自然语言处理领域得到更广泛应用,现有评估协议并未考虑模型语言使用如何与特定人类人口统计群体对齐——这恰恰是对话式AI应用中的重要考量因素。为弥补这一空白,我们探讨了如何衡量语言模型的语言技能并与人类子群体进行比较。我们借鉴言语语言病理学领域的临床评估技术,该领域已建立按(人类)年龄分层的语言习得标准规范。研究过程中,我们首先由领域专家(即持证临床言语语言病理学家)进行评估,随后提出可大规模替代临床评估的自动化技术。研究发现,语言模型能力因任务而异:GPT-3.5在词义推理任务中表现相当于典型6-9岁儿童水平,但机械记忆能力却超越典型21岁成人;GPT-3.5(InstructGPT版本)在社会性语言运用方面存在明显不足,测试语用能力达标率不足50%,在理解特定词性、词语联想关系及其他词汇特征时均出现错误。最终结果表明,当将这些模型作为面向公众的工具使用时,必须重视人口统计对齐与对话目标设定。我们的框架将通过代码、数据集和Python包开源发布。