We introduce EQ-Bench, a novel benchmark designed to evaluate aspects of emotional intelligence in Large Language Models (LLMs). We assess the ability of LLMs to understand complex emotions and social interactions by asking them to predict the intensity of emotional states of characters in a dialogue. The benchmark is able to discriminate effectively between a wide range of models. We find that EQ-Bench correlates strongly with comprehensive multi-domain benchmarks like MMLU (Hendrycks et al., 2020) (r=0.97), indicating that we may be capturing similar aspects of broad intelligence. Our benchmark produces highly repeatable results using a set of 60 English-language questions. We also provide open-source code for an automated benchmarking pipeline at https://github.com/EQ-bench/EQ-Bench and a leaderboard at https://www.eqbench.com
翻译:我们提出EQ-Bench,一种新颖的基准测试,旨在评估大型语言模型(LLMs)中情商的某些方面。通过要求模型预测对话中角色情绪状态的强度,我们评估LLMs理解复杂情绪和社交互动的能力。该基准能够有效区分广泛范围的模型。我们发现EQ-Bench与MMLU(Hendrycks等,2020)(r=0.97)等多领域综合基准测试高度相关,表明可能捕捉到广泛智力的相似方面。我们的基准使用60个英文问题生成高度可重复的结果。我们还提供了自动化基准测试管道的开源代码(https://github.com/EQ-bench/EQ-Bench)以及排行榜(https://www.eqbench.com)。