Expressive co-speech gestures are crucial for natural human-robot interaction, but generating them on physical humanoid robots is difficult because gesture strokes must align with speech emphasis while satisfying strict kinematic and dynamic constraints. Unlike virtual avatars, humanoid robots cannot freely execute rapid or overlapping motions, making word-level synchronization and hardware-safe motion planning a coupled problem. We present \textbf{WaveSync}, a hybrid framework in which a Large Language Model decomposes dialogue responses into structured semantic schemas and assigns per-word importance weights, constructing a continuous Semantic Importance Wave. Gesture trajectories are shaped through Dynamic Movement Primitives, enforcing kinematic feasibility while enhancing expressiveness. A Wavefront Optimization stage aligns peak-to-peak gesture-speech synchronization and resolves residual kinematic violations through gesture-duration compression and forward propagation. Experimental evaluation based on five dialogue scenarios shows that our method achieves high synchronization accuracy and outperforms three baselines in both objective and subjective evaluations. Each component in WaveSync plays a necessary role in producing gestures that are expressive, semantically grounded, and kinematically compliant. The code, resources, and videos are available at \href{https://github.com/pairs-lab/WaveSync}{WaveSync}
翻译:[translated abstract in Chinese]
富有表现力的共语手势对于自然的人机交互至关重要,但在实体人形机器人上生成这些手势十分困难,因为手势动作必须与语音重音对齐,同时满足严格的运动学和动力学约束。与虚拟化身不同,人形机器人无法自由执行快速或重叠运动,这使得词语级别的同步与硬件安全运动规划成为一个耦合问题。我们提出 \textbf{WaveSync}——一个混合框架,其中大语言模型将对话响应分解为结构化语义模式,并为每个词语分配重要性权重,从而构建连续的语义重要性波。手势轨迹通过动态运动基元进行塑造,在增强表现力的同时保障运动学可行性。波前优化阶段实现峰对峰的手势-语音同步,并通过手势时长压缩与前向传播解决残留的运动学冲突。基于五个对话场景的实验评估表明,我们的方法实现了较高的同步精度,并在客观与主观评价中均优于三个基线方法。WaveSync 中的每个组件在生成富有表现力、具有语义基础且符合运动学约束的手势中都发挥着必要作用。代码、资源及演示视频可在 \href{https://github.com/pairs-lab/WaveSync}{WaveSync} 获取。