SI-Bench: Benchmarking Social Intelligence of Large Language Models in Human-to-Human Conversations

As large language models (LLMs) develop anthropomorphic abilities, they are increasingly being deployed as autonomous agents to interact with humans. However, evaluating their performance in realistic and complex social interactions remains a significant challenge. Most previous research built datasets through simulated agent-to-agent interactions, which fails to capture the authentic linguistic styles and relational dynamics found in real human conversations. To address this gap, we introduce SI-Bench, a novel benchmark designed to evaluate aspects of social intelligence in LLMs. Grounded in broad social science theories, SI-Bench contains 2,221 authentic multi-turn dialogues collected from a social networking application. We further selected a subset of 312 dialogues for manual annotation across 8 major models. The experiments show that SOTA models have surpassed the human expert in process reasoning under complex social situations, yet they still fall behind humans in reply quality. Moreover, introducing Chain-of-Thought (CoT) reasoning may degrade the performance of LLMs in social dialogue tasks. All datasets are openly available at https://github.com/SI-Bench/SI-Bench.git.

翻译：随着大语言模型（LLMs）逐渐具备拟人化能力，它们越来越多地被部署为自主智能体与人类进行交互。然而，评估其在真实且复杂的社会互动中的表现仍是一个重大挑战。先前大多数研究通过模拟智能体间交互构建数据集，未能捕捉真实人类对话中特有的语言风格与关系动态。为填补这一空白，我们提出了SI-Bench——一个旨在评估大语言模型社会智能维度的新型基准测试。该基准基于广泛的社会科学理论，包含从社交网络应用中收集的2,221段真实多轮对话。我们进一步选取其中312段对话，对8个主流模型进行了人工标注。实验表明：在复杂社会情境下的过程推理任务中，当前最先进的模型已超越人类专家水平，但在回复质量方面仍落后于人类。此外，引入思维链（Chain-of-Thought, CoT）推理机制可能会降低大语言模型在社会对话任务中的表现。所有数据集已通过https://github.com/SI-Bench/SI-Bench.git公开提供。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日