While humans naturally gesture during speech, only a sparse subset of these movements are visually depictive and semantically linked to specific spoken words. Current multimodal models struggle to capture these semantic co-speech gestures, heavily bottlenecked by a lack of precisely annotated training data. To address this, we introduce the Gesture Recognition in the Wild (GRW) dataset, the first large-scale benchmark designed to map unconstrained human gestures to specific words with frame-accurate temporal boundaries. Comprising 156,688 manually annotated video clips, GRW spans a highly diverse 150-word taxonomy of physical actions, spatial descriptors, and abstract concepts. We leverage GRW to train video models to (a) classify gestures as semantic or not, (b) recognize the word corresponding to a co-speech gesture, and (c) temporally localize the gesture. We also use GRW to establish benchmarks for these three tasks.
翻译:尽管人类在说话时自然会做出手势,但只有其中一小部分动作具有视觉描绘性,并与特定的口语词汇存在语义关联。当前的多模态模型难以捕捉这些语义性伴随语音手势,主要受限于缺乏精确标注的训练数据。为解决这一问题,我们提出了“野外手势识别”(Gesture Recognition in the Wild, GRW)数据集,这是首个旨在将非约束性人类手势与特定词汇建立帧精确时间边界映射的大规模基准数据集。GRW包含156,688个手动标注的视频片段,覆盖涵盖物理动作、空间描述词及抽象概念的高度多样化150词分类体系。我们利用GRW训练视频模型,使其能够:(a) 区分手势是否具有语义性,(b) 识别伴随语音手势对应的词汇,以及(c) 对手势进行时间定位。同时,我们基于GRW为上述三项任务建立了基准评测体系。