GestureGPT: Toward Zero-shot Interactive Gesture Understanding and Grounding with Large Language Model Agents

Current gesture interfaces typically demand users to learn and perform gestures from a predefined set, which leads to a less natural experience. Interfaces supporting user-defined gestures eliminate the learning process, but users still need to demonstrate and associate the gesture to a specific system function themselves. We introduce GestureGPT, a free-form hand gesture understanding framework that does not require users to learn, demonstrate, or associate gestures. Our framework leverages the large language model's (LLM) astute common sense and strong inference ability to understand a spontaneously performed gesture from its natural language descriptions, and automatically maps it to a function provided by the interface. More specifically, our triple-agent framework involves a Gesture Description Agent that automatically segments and formulates natural language descriptions of hand poses and movements based on hand landmark coordinates. The description is deciphered by a Gesture Inference Agent through self-reasoning and querying about the interaction context (e.g., interaction history, gaze data), which a Context Management Agent organizes and provides. Following iterative exchanges, the Gesture Inference Agent discerns user intent, grounding it to an interactive function. We validated our conceptual framework under two real-world scenarios: smart home controlling and online video streaming. The average zero-shot Top-5 grounding accuracies are 83.59% for smart home tasks and 73.44% for video streaming. We also provided an extensive discussion of our framework including model selection rationale, generated description quality, generalizability etc.

翻译：当前手势交互界面通常要求用户学习和执行预定义的手势集合，导致体验不够自然。支持用户自定义手势的界面虽然消除了学习过程，但用户仍需自行演示手势并将其与特定系统功能关联。本文提出GestureGPT，一种自由形式的手势理解框架，无需用户学习、演示或关联手势。该框架利用大语言模型（LLM）敏锐的常识感知和强大推理能力，通过自然语言描述理解用户自发执行的手势，并自动将其映射到界面提供的功能。具体而言，我们的三智能体框架包含：基于手部关键点坐标自动分割手势并生成手部姿态与运动自然语言描述的“手势描述智能体”；通过自我推理及查询交互上下文（如交互历史、视线数据）来解读描述的“手势推理智能体”——该上下文由“上下文管理智能体”组织提供。经过多轮交互，手势推理智能体识别用户意图，并将其关联至交互功能。我们在智能家居控制和在线视频流两大现实场景中验证了概念框架，其零样本Top-5关联准确率在智能家居任务中平均达83.59%，在视频流任务中平均达73.44%。本文还深入探讨了框架的模型选择依据、生成描述质量、泛化能力等问题。