Emotional intelligence significantly impacts our daily behaviors and interactions. Although Large Language Models (LLMs) are increasingly viewed as a stride toward artificial general intelligence, exhibiting impressive performance in numerous tasks, it is still uncertain if LLMs can genuinely grasp psychological emotional stimuli. Understanding and responding to emotional cues gives humans a distinct advantage in problem-solving. In this paper, we take the first step towards exploring the ability of LLMs to understand emotional stimuli. To this end, we first conduct automatic experiments on 45 tasks using various LLMs, including Flan-T5-Large, Vicuna, Llama 2, BLOOM, ChatGPT, and GPT-4. Our tasks span deterministic and generative applications that represent comprehensive evaluation scenarios. Our automatic experiments show that LLMs have a grasp of emotional intelligence, and their performance can be improved with emotional prompts (which we call "EmotionPrompt" that combines the original prompt with emotional stimuli), e.g., 8.00% relative performance improvement in Instruction Induction and 115% in BIG-Bench. In addition to those deterministic tasks that can be automatically evaluated using existing metrics, we conducted a human study with 106 participants to assess the quality of generative tasks using both vanilla and emotional prompts. Our human study results demonstrate that EmotionPrompt significantly boosts the performance of generative tasks (10.9% average improvement in terms of performance, truthfulness, and responsibility metrics). We provide an in-depth discussion regarding why EmotionPrompt works for LLMs and the factors that may influence its performance. We posit that EmotionPrompt heralds a novel avenue for exploring interdisciplinary knowledge for human-LLMs interaction.
翻译:情感智力显著影响我们的日常行为和互动。尽管大语言模型(LLMs)越来越被视为通向人工通用智能的一步,并在众多任务中展现出令人印象深刻的表现,但目前仍不确定LLMs能否真正理解心理情感刺激。理解和回应情感线索使人类在问题解决中具有独特优势。本文首次探索LLMs理解情感刺激的能力。为此,我们首先使用多种LLMs(包括Flan-T5-Large、Vicuna、Llama 2、BLOOM、ChatGPT和GPT-4)在45项任务上进行了自动实验。我们的任务涵盖确定性和生成性应用,代表了全面的评估场景。自动实验表明,LLMs具备情感智力,并且其性能可通过情感提示(我们称之为“EmotionPrompt”,即将原始提示与情感刺激相结合)得到提升,例如在指令归纳任务中相对性能提升8.00%,在BIG-Bench中提升115%。除了这些可使用现有指标自动评估的确定性任务外,我们还进行了一项包含106名参与者的人类研究,以评估使用普通提示和情感提示的生成任务质量。人类研究结果表明,EmotionPrompt显著提升了生成任务的性能(在性能、真实性和责任性指标上平均提升10.9%)。我们深入讨论了EmotionPrompt为何对LLMs有效以及可能影响其性能的因素。我们认为EmotionPrompt为探索人机交互的跨学科知识开辟了新的途径。