Humans intuitively move to sound, but current humanoid robots lack expressive improvisational capabilities, confined to predefined motions or sparse commands. Generating motion from audio and then retargeting it to robots relies on explicit motion reconstruction, leading to cascaded errors, high latency, and disjointed acoustic-actuation mapping. We propose RoboPerform, the first unified audio-to-locomotion framework that can directly generate music-driven dance and speech-driven co-speech gestures from audio. Guided by the core principle of "motion = content + style", the framework treats audio as implicit style signals and eliminates the need for explicit motion reconstruction. RoboPerform integrates a ResMoE teacher policy for adapting to diverse motion patterns and a diffusion-based student policy for audio style injection. This retargeting-free design ensures low latency and high fidelity. Experimental validation shows that RoboPerform achieves promising results in physical plausibility and audio alignment, successfully transforming robots into responsive performers capable of reacting to audio.
翻译:人类能够本能地随乐而动,但当前的拟人机器人缺乏表现性的即兴运动能力,其运动被局限于预定义动作或稀疏指令。现有方法通常从音频生成运动,再将其重定向至机器人,这依赖于显式的运动重建,导致级联误差、高延迟以及声学-驱动映射的割裂。我们提出了RoboPerform——首个统一的音频到运动框架,能够直接从音频生成音乐驱动的舞蹈以及语音驱动的伴随手势。该框架以“运动 = 内容 + 风格”为核心原则,将音频视为隐式的风格信号,从而无需进行显式的运动重建。RoboPerform集成了一个用于适应多样化运动模式的ResMoE教师策略,以及一个用于注入音频风格的基于扩散的学生策略。这种无需重定向的设计确保了低延迟和高保真度。实验验证表明,RoboPerform在物理合理性和音频对齐方面均取得了令人满意的结果,成功地将机器人转变为能够响应音频的灵动机器人表演者。