We introduce ChatPose, a framework employing Large Language Models (LLMs) to understand and reason about 3D human poses from images or textual descriptions. Our work is motivated by the human ability to intuitively understand postures from a single image or a brief description, a process that intertwines image interpretation, world knowledge, and an understanding of body language. Traditional human pose estimation and generation methods often operate in isolation, lacking semantic understanding and reasoning abilities. ChatPose addresses these limitations by embedding SMPL poses as distinct signal tokens within a multimodal LLM, enabling the direct generation of 3D body poses from both textual and visual inputs. Leveraging the powerful capabilities of multimodal LLMs, ChatPose unifies classical 3D human pose and generation tasks while offering user interactions. Additionally, ChatPose empowers LLMs to apply their extensive world knowledge in reasoning about human poses, leading to two advanced tasks: speculative pose generation and reasoning about pose estimation. These tasks involve reasoning about humans to generate 3D poses from subtle text queries, possibly accompanied by images. We establish benchmarks for these tasks, moving beyond traditional 3D pose generation and estimation methods. Our results show that ChatPose outperforms existing multimodal LLMs and task-specific methods on these newly proposed tasks. Furthermore, ChatPose's ability to understand and generate 3D human poses based on complex reasoning opens new directions in human pose analysis.
翻译:我们提出ChatPose框架,该框架利用大型语言模型(LLMs)从图像或文本描述中理解与推理三维人体姿态。这一工作的灵感来源于人类通过单张图像或简短描述直觉理解姿态的能力——这一过程交织着图像解读、世界知识及肢体语言理解。传统的人体姿态估计与生成方法往往独立运作,缺乏语义理解与推理能力。ChatPose通过将SMPL姿态作为显式信号令牌嵌入多模态LLM,实现对文本与视觉输入直接生成三维人体姿态,从而突破上述局限。借助多模态LLM的强大能力,ChatPose统一了经典的三维人体姿态估计与生成任务,同时支持用户交互。此外,ChatPose赋予LLM运用其广泛世界知识推理人体姿态的能力,由此衍生出两项高级任务:推测性姿态生成与姿态估计推理。这些任务需在可能包含图像辅助的微妙文本查询指导下,通过人类推理生成三维姿态。我们为这些任务建立了基准测试,超越了传统三维姿态生成与估计方法。实验结果表明,ChatPose在这些新任务上优于现有主流多模态LLM及专用方法。更关键的是,ChatPose基于复杂推理理解与生成三维人体姿态的能力,为人脸姿态分析开辟了新的研究方向。