Humans intuitively move to sound, but current humanoid robots lack expressive improvisational capabilities, confined to predefined motions or sparse commands. Generating motion from audio and then retargeting it to robots relies on explicit motion reconstruction, leading to cascaded errors, high latency, and disjointed acoustic-actuation mapping. We propose RoboPerform, the first unified audio-to-locomotion framework that can directly generate music-driven dance and speech-driven co-speech gestures from audio. Guided by the core principle of "motion = content + style", the framework treats audio as implicit style signals and eliminates the need for explicit motion reconstruction. RoboPerform integrates a ResMoE teacher policy for adapting to diverse motion patterns and a diffusion-based student policy for audio style injection. This retargeting-free design ensures low latency and high fidelity. Experimental validation shows that RoboPerform achieves promising results in physical plausibility and audio alignment, successfully transforming robots into responsive performers capable of reacting to audio.
翻译:人类能够本能地随乐而动,但当前的拟人机器人缺乏即兴表现能力,局限于预定义动作或稀疏指令。现有方法通常从音频生成动作,再将其重定向至机器人,这依赖于显式的动作重建,导致级联误差、高延迟以及声学-驱动映射的割裂。我们提出了RoboPerform,这是首个统一的音频到运动框架,能够直接从音频生成音乐驱动的舞蹈和语音驱动的伴随手势。该框架以“动作 = 内容 + 风格”为核心原则,将音频视为隐式的风格信号,从而无需显式的动作重建。RoboPerform集成了一个ResMoE教师策略用于适应多样化的运动模式,以及一个基于扩散的学生策略用于注入音频风格。这种无需重定向的设计确保了低延迟和高保真度。实验验证表明,RoboPerform在物理合理性和音频对齐方面取得了良好的效果,成功地将机器人转变为能够响应音频的表演者。