As large language models (LLMs) develop anthropomorphic abilities, they are increasingly being deployed as autonomous agents to interact with humans. However, evaluating their performance in realistic and complex social interactions remains a significant challenge. Most previous research built datasets through simulated agent-to-agent interactions, which fails to capture the authentic linguistic styles and relational dynamics found in real human conversations. To address this gap, we introduce SI-Bench, a novel benchmark designed to evaluate aspects of social intelligence in LLMs. Grounded in broad social science theories, SI-Bench contains 2,221 authentic multi-turn dialogues collected from a social networking application. We further selected a subset of 312 dialogues for manual annotation across 8 major models. The experiments show that SOTA models have surpassed the human expert in process reasoning under complex social situations, yet they still fall behind humans in reply quality. Moreover, introducing Chain-of-Thought (CoT) reasoning may degrade the performance of LLMs in social dialogue tasks. All datasets are openly available at https://github.com/SI-Bench/SI-Bench.git.
翻译:随着大语言模型(LLMs)逐渐具备拟人化能力,它们越来越多地被部署为自主智能体与人类进行交互。然而,评估其在真实且复杂的社会互动中的表现仍是一个重大挑战。先前大多数研究通过模拟智能体间交互构建数据集,未能捕捉真实人类对话中特有的语言风格与关系动态。为填补这一空白,我们提出了SI-Bench——一个旨在评估大语言模型社会智能维度的新型基准测试。该基准基于广泛的社会科学理论,包含从社交网络应用中收集的2,221段真实多轮对话。我们进一步选取其中312段对话,对8个主流模型进行了人工标注。实验表明:在复杂社会情境下的过程推理任务中,当前最先进的模型已超越人类专家水平,但在回复质量方面仍落后于人类。此外,引入思维链(Chain-of-Thought, CoT)推理机制可能会降低大语言模型在社会对话任务中的表现。所有数据集已通过https://github.com/SI-Bench/SI-Bench.git公开提供。