EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning

Facial expression recognition (FER) is an important research topic in emotional artificial intelligence. In recent decades, researchers have made remarkable progress. However, current FER paradigms face challenges in generalization, lack semantic information aligned with natural language, and struggle to process both images and videos within a unified framework, making their application in multimodal emotion understanding and human-computer interaction difficult. Multimodal Large Language Models (MLLMs) have recently achieved success, offering advantages in addressing these issues and potentially overcoming the limitations of current FER paradigms. However, directly applying pre-trained MLLMs to FER still faces several challenges. Our zero-shot evaluations of existing open-source MLLMs on FER indicate a significant performance gap compared to GPT-4V and current supervised state-of-the-art (SOTA) methods. In this paper, we aim to enhance MLLMs' capabilities in understanding facial expressions. We first generate instruction data for five FER datasets with Gemini. We then propose a novel MLLM, named EMO-LLaMA, which incorporates facial priors from a pretrained facial analysis network to enhance human facial information. Specifically, we design a Face Info Mining module to extract both global and local facial information. Additionally, we utilize a handcrafted prompt to introduce age-gender-race attributes, considering the emotional differences across different human groups. Extensive experiments show that EMO-LLaMA achieves SOTA-comparable or competitive results across both static and dynamic FER datasets. The instruction dataset and code are available at https://github.com/xxtars/EMO-LLaMA.

翻译：面部表情识别是情感人工智能领域的一个重要研究课题。近几十年来，研究者们已取得了显著进展。然而，当前的面部表情识别范式在泛化能力上面临挑战，缺乏与自然语言对齐的语义信息，并且难以在统一框架内处理图像和视频，这使其在多模态情绪理解与人机交互中的应用变得困难。多模态大语言模型近期取得了成功，在解决这些问题上展现出优势，并有望克服当前面部表情识别范式的局限性。然而，直接将预训练的多模态大语言模型应用于面部表情识别仍面临若干挑战。我们对现有开源多模态大语言模型在面部表情识别任务上的零样本评估表明，其性能与GPT-4V及当前有监督的最先进方法存在显著差距。本文旨在增强多模态大语言模型在理解面部表情方面的能力。我们首先利用Gemini为五个面部表情识别数据集生成指令数据。随后，我们提出了一种名为EMO-LLaMA的新型多模态大语言模型，该模型通过融入来自预训练面部分析网络的面部先验知识来增强人类面部信息。具体而言，我们设计了一个面部信息挖掘模块，以提取全局和局部的面部信息。此外，考虑到不同人群间的情绪差异，我们利用手工构建的提示词引入了年龄-性别-种族属性。大量实验表明，EMO-LLaMA在静态和动态面部表情识别数据集上均取得了与最先进方法相当或具有竞争力的结果。指令数据集和代码可在 https://github.com/xxtars/EMO-LLaMA 获取。