While large pre-trained language models (LMs) find greater use across NLP, existing evaluation protocols do not consider how LM language use aligns with particular human demographic groups, which can be an important consideration in conversational AI applications. To remedy this gap, we consider how LM language skills can be measured and compared to human sub-populations. We suggest clinical techniques from Speech Language Pathology, which has well-established norms for acquisition of language skills, organized by (human) age. We conduct evaluation with a domain expert (i.e., a clinically licensed speech language pathologist), and also propose automated techniques to substitute clinical evaluation at scale. We find LM capability varies widely depending on task with GPT-3.5 mimicking the ability of a typical 6-9 year old at tasks requiring inference about word meanings and simultaneously outperforming a typical 21 year old at memorization. GPT-3.5 (InstructGPT) also has trouble with social language use, exhibiting less than 50\% of the tested pragmatic skills. It shows errors in understanding particular word parts-of-speech and associative word relations, among other lexical features. Ultimately, findings reiterate the importance of considering demographic alignment and conversational goals when using these models as public-facing tools. Our framework will be publicly available via code, data, and a python package.
翻译:尽管大规模预训练语言模型在自然语言处理中的应用日益广泛,现有评估协议并未考虑其语言使用方式与特定人类人口统计群体的一致性,而这在对话式人工智能应用中至关重要。为填补这一空白,我们探讨如何测量语言模型的语言技能并与人亚群体进行比较。我们借鉴言语病理学中的临床评估技术——该领域已建立按(人类)年龄分层的语言习得标准规范。通过与领域专家(即持证临床言语病理学家)开展评估,并提出可大规模替代临床评估的自动化技术。研究发现,语言模型能力因任务而异:GPT-3.5在需要推断词义的任务中表现出典型6-9岁儿童的能力水平,而在记忆任务中同时超越典型21岁成人。GPT-3.5(InstructGPT)在社会性语言使用方面存在困难,测试语用技能的正确率不足50%。该模型在理解特定词性、词语联想关系及其他词汇特征时存在错误。最终,研究结果重申了在将这些模型用作面向公众的工具时,考虑人口统计对齐与对话目标的重要性。我们的框架将通过代码、数据集及Python包向公众开放。