Current gesture recognition systems primarily focus on identifying gestures within a predefined set, leaving a gap in connecting these gestures to interactive GUI elements or system functions (e.g., linking a 'thumb-up' gesture to a 'like' button). We introduce GestureGPT, a novel zero-shot gesture understanding and grounding framework leveraging large language models (LLMs). Gesture descriptions are formulated based on hand landmark coordinates from gesture videos and fed into our dual-agent dialogue system. A gesture agent deciphers these descriptions and queries about the interaction context (e.g., interface, history, gaze data), which a context agent organizes and provides. Following iterative exchanges, the gesture agent discerns user intent, grounding it to an interactive function. We validated the gesture description module using public first-view and third-view gesture datasets and tested the whole system in two real-world settings: video streaming and smart home IoT control. The highest zero-shot Top-5 grounding accuracies are 80.11% for video streaming and 90.78% for smart home tasks, showing potential of the new gesture understanding paradigm.
翻译:当前手势识别系统主要聚焦于识别预定义集合内的手势,但在将这些手势与图形用户界面元素或系统功能建立联系方面仍存在空白(例如,将"竖拇指"手势关联至"点赞"按钮)。本文提出GestureGPT——一种利用大语言模型的新型零样本手势理解与接地框架。基于手势视频中的手部关键点坐标生成手势描述,并将其输入至双代理对话系统。手势代理解析这些描述并查询交互上下文(如界面、历史记录、注视数据),而上下文代理则负责组织并提供这些信息。通过迭代式信息交换,手势代理能够识别用户意图,并将其接地至交互功能。我们利用公开的第一人称与第三人称手势数据集验证了手势描述模块,并在视频流媒体和智能家居物联网控制两种真实场景中对整个系统进行了测试。在视频流媒体任务中,零样本Top-5接地准确率最高达80.11%;在智能家居任务中则达到90.78%,展现了这一新型手势理解范式的潜力。