Gesture-driven music generation is an emerging human-computer interaction paradigm for touch-free and expressive musical interaction. However, many existing approaches treat the task as isolated gesture classification or map gestures to symbolic outputs such as MIDI followed by a separate rendering stage, which limits temporal continuity and real-time responsiveness. This work presents Gesture2Music, a low-latency streaming framework for continuous gesture-driven music generation from live webcam feed. The system processes sequences of body and hand landmarks and uses a causal temporal convolutional network (TCN) to predict note-level musical control events, including pitch, octave, onset, sustain, amplitude, and activity state. Because available gesture-note datasets typically contain only isolated single-note recordings rather than continuous performance sequences, a synthetic stream generation strategy is introduced to construct continuous gesture streams by concatenating single-note clips and deriving heuristic temporal event labels. Temporal consistency and spectral proxy losses are further used to reduce prediction jitter and encourage audio-consistent outputs. During inference, predicted musical events are rendered into continuous music using predefined note samples with rhythmic quantization and scale-constrained filtering for improved musical stability. Experiments on a custom gesture-to-music dataset with 21 gesture-note classes spanning seven tones across three pitch levels demonstrate stable real-time performance, low inference latency of 30\,ms, and improved temporal continuity.
翻译:手势驱动音乐生成是一种新兴的人机交互范式,可实现免触控且富有表现力的音乐交互。然而,许多现有方法将任务视为孤立的姿态分类,或通过将手势映射为MIDI等符号化输出后再经独立渲染阶段处理,这限制了时间连续性和实时响应能力。本文提出Gesture2Music——一种基于实时网络摄像头流的低延迟流式框架,用于连续手势驱动的音乐生成。该系统处理身体与手部关键点序列,并采用因果时序卷积网络(TCN)来预测音符级别的音乐控制事件,包括音高、八度、起始、延音、振幅和激活状态。由于现有手势-音符数据集通常仅包含孤立的单音符录音而非连续演奏序列,本文提出一种合成流生成策略,通过拼接单音符片段并推导启发式时序事件标签来构建连续手势流。进一步采用时序一致性与频谱代理损失函数来减少预测抖振并促进与音频一致的结果输出。在推理阶段,通过预定义音符样本、节奏量化及音阶约束滤波将预测的音乐事件渲染为连续音乐,以提升音乐稳定性。在包含21类跨越三个音高等级共七个音调的手势-音符类别自定义数据集上的实验表明,该方法实现了稳定的实时性能、30毫秒的低推理延迟以及更优的时间连续性。