Very large language models (LLMs) perform extremely well on a spectrum of NLP tasks in a zero-shot setting. However, little is known about their performance on human-level NLP problems which rely on understanding psychological concepts, such as assessing personality traits. In this work, we investigate the zero-shot ability of GPT-3 to estimate the Big 5 personality traits from users' social media posts. Through a set of systematic experiments, we find that zero-shot GPT-3 performance is somewhat close to an existing pre-trained SotA for broad classification upon injecting knowledge about the trait in the prompts. However, when prompted to provide fine-grained classification, its performance drops to close to a simple most frequent class (MFC) baseline. We further analyze where GPT-3 performs better, as well as worse, than a pretrained lexical model, illustrating systematic errors that suggest ways to improve LLMs on human-level NLP tasks.
翻译:超大型语言模型(LLMs)在零样本设置下对一系列自然语言处理任务表现出色。然而,关于它们在依赖理解心理概念(如评估人格特质)的人类层面自然语言处理问题上的表现,目前知之甚少。本研究系统探讨了GPT-3从用户社交媒体帖子中零样本评估大五人格特质的能力。通过一系列系统性实验,我们发现:在提示中注入相关特质知识后,GPT-3的零样本表现已接近现有预训练最优模型(SotA)的粗粒度分类水平;但在进行细粒度分类时,其性能骤降至接近简单最频繁类别(MFC)基线。我们进一步分析了GPT-3相较于预训练词汇模型的优势与劣势,揭示了系统性误差模式,为提升LLMs在人类层面自然语言处理任务上的性能提供了改进方向。