Large language models (LLMs) have been shown to perform well at a variety of syntactic, discourse, and reasoning tasks. While LLMs are increasingly deployed in many forms including conversational agents that interact with humans, we lack a grounded benchmark to measure how well LLMs understand \textit{social} language. Here, we introduce a new theory-driven benchmark, SocKET, that contains 58 NLP tasks testing social knowledge which we group into five categories: humor & sarcasm, offensiveness, sentiment & emotion, and trustworthiness. In tests on the benchmark, we demonstrate that current models attain only moderate performance but reveal significant potential for task transfer among different types and categories of tasks, which were predicted from theory. Through zero-shot evaluations, we show that pretrained models already possess some innate but limited capabilities of social language understanding and training on one category of tasks can improve zero-shot testing on others. Our benchmark provides a systematic way to analyze model performance on an important dimension of language and points to clear room for improvement to build more socially-aware LLMs. The associated resources are released at https://github.com/minjechoi/SOCKET.
翻译:大型语言模型(LLMs)在句法、语篇及推理任务中已展现出优异性能。尽管LLMs正日益被部署为与人类交互的对话代理等多种形式,但目前仍缺乏一个基于理论基础的基准来评估模型对社会性语言的理解能力。为此,我们提出一个全新的理论驱动型基准SocKET,包含58项测试社会知识的自然语言处理任务,并将其划分为五大类别:幽默与讽刺、冒犯性、情感与情绪、可信度。基于该基准的测试表明,当前模型仅能达到中等性能,但理论预测表明,不同种类和类别的任务间存在显著的任务迁移潜力。通过零样本评估,我们发现预训练模型已具备部分先天但有限的社会语言理解能力,且针对某一类别的训练可提升其他类别的零样本测试表现。本基准为分析模型在语言重要维度上的表现提供了系统性方法,并指出了构建更具社会意识的LLMs的明确改进空间。相关资源已发布于https://github.com/minjechoi/SOCKET。