Current gesture recognition systems primarily focus on identifying gestures within a predefined set, leaving a gap in connecting these gestures to interactive GUI elements or system functions (e.g., linking a 'thumb-up' gesture to a 'like' button). We introduce GestureGPT, a novel zero-shot gesture understanding and grounding framework leveraging large language models (LLMs). Gesture descriptions are formulated based on hand landmark coordinates from gesture videos and fed into our dual-agent dialogue system. A gesture agent deciphers these descriptions and queries about the interaction context (e.g., interface, history, gaze data), which a context agent organizes and provides. Following iterative exchanges, the gesture agent discerns user intent, grounding it to an interactive function. We validated the gesture description module using public first-view and third-view gesture datasets and tested the whole system in two real-world settings: video streaming and smart home IoT control. The highest zero-shot Top-5 grounding accuracies are 80.11% for video streaming and 90.78% for smart home tasks, showing potential of the new gesture understanding paradigm.
翻译:当前手势识别系统主要专注于识别预定义集合中的手势,无法将这些手势与图形用户界面元素或系统功能建立关联(例如,将"竖大拇指"手势链接至"点赞"按钮)。我们提出GestureGPT——一种新颖的零样本手势理解与基准定位框架,该框架利用大语言模型(LLM)。基于手势视频中的手部关键点坐标构建手势描述,并将其输入至我们提出的双代理对话系统。手势代理解码这些描述并查询交互上下文(如界面、历史记录、注视数据),由上下文代理整理并提供相应信息。经过迭代交互后,手势代理推断用户意图,并将其基准定位至交互功能。我们使用公开的第一视角和第三视角手势数据集验证手势描述模块,并在视频流媒体和智能家居物联网控制两个实际场景中测试整个系统。零样本Top-5基准定位最高准确率在视频流媒体任务中达80.11%,在智能家居任务中达90.78%,展示了这种新型手势理解范式的潜力。