Leveraging Speech for Gesture Detection in Multimodal Communication

Gestures are inherent to human interaction and often complement speech in face-to-face communication, forming a multimodal communication system. An important task in gesture analysis is detecting a gesture's beginning and end. Research on automatic gesture detection has primarily focused on visual and kinematic information to detect a limited set of isolated or silent gestures with low variability, neglecting the integration of speech and vision signals to detect gestures that co-occur with speech. This work addresses this gap by focusing on co-speech gesture detection, emphasising the synchrony between speech and co-speech hand gestures. We address three main challenges: the variability of gesture forms, the temporal misalignment between gesture and speech onsets, and differences in sampling rate between modalities. We investigate extended speech time windows and employ separate backbone models for each modality to address the temporal misalignment and sampling rate differences. We utilize Transformer encoders in cross-modal and early fusion techniques to effectively align and integrate speech and skeletal sequences. The study results show that combining visual and speech information significantly enhances gesture detection performance. Our findings indicate that expanding the speech buffer beyond visual time segments improves performance and that multimodal integration using cross-modal and early fusion techniques outperforms baseline methods using unimodal and late fusion methods. Additionally, we find a correlation between the models' gesture prediction confidence and low-level speech frequency features potentially associated with gestures. Overall, the study provides a better understanding and detection methods for co-speech gestures, facilitating the analysis of multimodal communication.

翻译：手势是人类交流的固有组成部分，在面对面沟通中常与语音互补，构成多模态交流系统。手势分析的一项重要任务是检测手势的起始与结束。现有自动手势检测研究主要依赖视觉和运动学信息，识别有限且低变异性的孤立或无声手势，忽视了语音与视觉信号的整合来检测与语音共现的手势。本研究聚焦于共语手势检测，强调语音与手部共语手势的同步性，以填补这一空白。我们解决三个主要挑战：手势形式的多样性、手势与语音起始时间的错位、以及不同模态间采样率的差异。我们探索扩展语音时间窗口，并为每种模态采用独立的主干模型，以应对时间错位和采样率差异。利用Transformer编码器进行跨模态和早期融合技术，有效对齐并整合语音与骨骼序列。研究结果表明，结合视觉与语音信息能显著提升手势检测性能。我们的发现指出，将语音缓冲扩展至视觉时间段之外可改善性能，且采用跨模态和早期融合技术的多模态整合方法优于基于单模态和晚期融合的基线方法。此外，我们发现模型手势预测置信度与可能关联手势的低层语音频率特征之间存在相关性。总体而言，本研究为共语手势提供了更深入的理解和检测方法，有助于多模态交流分析。