We introduce EQ-Bench, a novel benchmark designed to evaluate aspects of emotional intelligence in Large Language Models (LLMs). We assess the ability of LLMs to understand complex emotions and social interactions by asking them to predict the intensity of emotional states of characters in a dialogue. The benchmark is able to discriminate effectively between a wide range of models. We find that EQ-Bench correlates strongly with comprehensive multi-domain benchmarks like MMLU (Hendrycks et al., 2020) (r=0.97), indicating that we may be capturing similar aspects of broad intelligence. Our benchmark produces highly repeatable results using a set of 60 English-language questions. We also provide open-source code for an automated benchmarking pipeline at https://github.com/EQ-bench/EQ-Bench and a leaderboard at https://eqbench.com
翻译:我们提出EQ-Bench,一个旨在评估大型语言模型(LLM)情商维度的新型基准测试。通过要求模型预测对话中角色情绪状态的强度,我们评估其理解复杂情绪与社会互动的能力。该基准测试能有效区分涵盖广泛范围的模型。研究发现,EQ-Bench与MMLU(Hendrycks等,2020)等多领域综合基准高度相关(r=0.97),表明我们可能捕捉到了广泛智能中的相似维度。该基准测试使用60道英文问题即可产生高度可重复的结果。我们还在https://github.com/EQ-bench/EQ-Bench 提供了自动化基准测试流程的开源代码,并在https://eqbench.com 上发布了排行榜。