We present a new challenge to examine whether large language models understand social norms. In contrast to existing datasets, our dataset requires a fundamental understanding of social norms to solve. Our dataset features the largest set of social norm skills, consisting of 402 skills and 12,383 questions covering a wide set of social norms ranging from opinions and arguments to culture and laws. We design our dataset according to the K-12 curriculum. This enables the direct comparison of the social understanding of large language models to humans, more specifically, elementary students. While prior work generates nearly random accuracy on our benchmark, recent large language models such as GPT3.5-Turbo and LLaMA2-Chat are able to improve the performance significantly, only slightly below human performance. We then propose a multi-agent framework based on large language models to improve the models' ability to understand social norms. This method further improves large language models to be on par with humans. Given the increasing adoption of large language models in real-world applications, our finding is particularly important and presents a unique direction for future improvements. The proposed method and dataset are available in https://huggingface.co/datasets/socialdataset2024/social.
翻译:我们提出了一项新的挑战,旨在检验大型语言模型是否理解社会规范。与现有数据集不同,我们的数据集需要基于对社会规范的根本理解才能求解。该数据集涵盖最大规模的社会规范技能集,包含402项技能与12,383个问题,覆盖从观点争论到文化法律等广泛的社会规范领域。我们依据K-12课程体系设计数据集,可直接将大型语言模型与人类(特别是小学生)的社会理解能力进行对比。虽然先前的研究在基准测试中仅能达到接近随机的准确率,但近期如GPT3.5-Turbo和LLaMA2-Chat等大型语言模型已能显著提升性能,仅略低于人类水平。我们进而提出基于大型语言模型的多智能体框架,用于增强模型理解社会规范的能力。该方法进一步使大型语言模型达到与人类相当的水平。鉴于大型语言模型在现实应用中的日益普及,我们的发现具有特殊重要性,并为未来改进指明了独特方向。所提出的方法和数据集可于https://huggingface.co/datasets/socialdataset2024/social获取。