We present a new challenge to examine whether large language models understand social norms. In contrast to existing datasets, our dataset requires a fundamental understanding of social norms to solve. Our dataset features the largest set of social norm skills, consisting of 402 skills and 12,383 questions covering a wide set of social norms ranging from opinions and arguments to culture and laws. We design our dataset according to the K-12 curriculum. This enables the direct comparison of the social understanding of large language models to humans, more specifically, elementary students. While prior work generates nearly random accuracy on our benchmark, recent large language models such as GPT3.5-Turbo and LLaMA2-Chat are able to improve the performance significantly, only slightly below human performance. We then propose a multi-agent framework based on large language models to improve the models' ability to understand social norms. This method further improves large language models to be on par with humans. Given the increasing adoption of large language models in real-world applications, our finding is particularly important and presents a unique direction for future improvements.
翻译:我们提出一项新挑战,旨在检验大语言模型是否理解社会规范。与现有数据集不同,我们的数据集需要对社会规范有根本性的理解才能解决。该数据集包含了最大规模的社会规范技能集合,涵盖402项技能及12,383个问题,涉及从观点、论辩到文化和法律等广泛的社会规范领域。我们依据K-12课程体系设计该数据集,从而能够直接将大语言模型的社会理解能力与人类(特别是小学生)进行对比。尽管先前的研究在我们的基准测试中仅能达到接近随机的准确率,但GPT3.5-Turbo和LLaMA2-Chat等最新大语言模型已能将性能显著提升至仅略低于人类水平。随后我们提出一种基于大语言模型的多智能体框架,以增强模型理解社会规范的能力。该方法进一步将大语言模型的性能提升至与人类相当。鉴于大语言模型在现实应用中的日益普及,这一发现具有特别重要的意义,并为未来改进提供了独特方向。