Existing gesture interfaces only work with a fixed set of gestures defined either by interface designers or by users themselves, which introduces learning or demonstration efforts that diminish their naturalness. Humans, on the other hand, understand free-form gestures by synthesizing the gesture, context, experience, and common sense. In this way, the user does not need to learn, demonstrate, or associate gestures. We introduce GestureGPT, a free-form hand gesture understanding framework that mimics human gesture understanding procedures to enable a natural free-form gestural interface. Our framework leverages multiple Large Language Model agents to manage and synthesize gesture and context information, then infers the interaction intent by associating the gesture with an interface function. More specifically, our triple-agent framework includes a Gesture Description Agent that automatically segments and formulates natural language descriptions of hand poses and movements based on hand landmark coordinates. The description is deciphered by a Gesture Inference Agent through self-reasoning and querying about the interaction context (e.g., interaction history, gaze data), which is managed by a Context Management Agent. Following iterative exchanges, the Gesture Inference Agent discerns the user's intent by grounding it to an interactive function. We validated our framework offline under two real-world scenarios: smart home control and online video streaming. The average zero-shot Top-1/Top-5 grounding accuracies are 44.79%/83.59% for smart home tasks and 37.50%/73.44% for video streaming tasks. We also provide an extensive discussion that includes rationale for model selection, generalizability, and future research directions for a practical system etc.
翻译:现有手势交互界面仅支持由界面设计者或用户预先定义的一组固定手势,这引入了学习或演示成本,削弱了交互的自然性。相比之下,人类通过综合手势动作、交互情境、经验知识与常识来理解自由形式手势。这种方式使得用户无需学习、演示或关联特定手势。本文提出GestureGPT,一种模仿人类手势理解机制的自由形式手势理解框架,旨在实现自然流畅的自由手势交互界面。该框架利用多个大型语言模型智能体来管理与融合手势及情境信息,进而通过将手势与界面功能关联来推断交互意图。具体而言,我们的三智能体框架包含:基于手部关键点坐标自动分割手势并生成手部姿态与运动自然语言描述的"手势描述智能体";通过自主推理与查询交互情境(如交互历史、视线数据)来解析描述的"手势推断智能体";以及管理情境信息的"情境管理智能体"。经过多轮信息交换,手势推断智能体通过将用户意图锚定至具体交互功能来实现意图识别。我们在两个真实场景下对框架进行了离线验证:智能家居控制与在线视频流媒体交互。在零样本设置下,智能家居任务的Top-1/Top-5锚定准确率分别为44.79%/83.59%,视频流媒体任务则为37.50%/73.44%。本文还展开了系统性讨论,涵盖模型选择依据、泛化能力分析以及实用化系统的未来研究方向等。