Emotion recognition based on body movements is vital in human-computer interaction. However, existing emotion recognition methods predominantly focus on enhancing classification accuracy, often neglecting the provision of textual explanations to justify their classifications. In this paper, we propose an Emotion-Action Interpreter powered by Large Language Model (EAI-LLM), which not only recognizes emotions but also generates textual explanations by treating 3D body movement data as unique input tokens within large language models (LLMs). Specifically, we propose a multi-granularity skeleton tokenizer designed for LLMs, which separately extracts spatio-temporal tokens and semantic tokens from the skeleton data. This approach allows LLMs to generate more nuanced classification descriptions while maintaining robust classification performance. Furthermore, we treat the skeleton sequence as a specific language and propose a unified skeleton token module. This module leverages the extensive background knowledge and language processing capabilities of LLMs to address the challenges of joint training on heterogeneous datasets, thereby significantly enhancing recognition accuracy on individual datasets. Experimental results demonstrate that our model achieves recognition accuracy comparable to existing methods. More importantly, with the support of background knowledge from LLMs, our model can generate detailed emotion descriptions based on classification results, even when trained on a limited amount of labeled skeleton data.
翻译:基于身体动作的情感识别在人机交互中至关重要。然而,现有情感识别方法主要侧重于提高分类准确性,往往忽视提供文本解释来论证其分类结果。本文提出一种由大语言模型驱动的情感-动作解释器(EAI-LLM),该模型不仅能识别情感,还能通过将三维身体运动数据作为大语言模型(LLMs)中的独特输入标记来生成文本解释。具体而言,我们提出一种专为LLMs设计的多粒度骨骼标记器,可从骨骼数据中分别提取时空标记和语义标记。该方法使LLMs在保持稳健分类性能的同时,能够生成更细致的分类描述。此外,我们将骨骼序列视为特定语言,并提出统一的骨骼标记模块。该模块利用LLMs丰富的背景知识和语言处理能力,应对异构数据集联合训练的挑战,从而显著提升在单个数据集上的识别准确率。实验结果表明,我们的模型达到了与现有方法相当的识别准确率。更重要的是,在LLMs背景知识的支持下,即使仅使用少量带标签的骨骼数据进行训练,我们的模型也能基于分类结果生成详细的情感描述。