Recent advances in humanoid whole-body motion tracking have enabled the execution of diverse and highly coordinated motions on real hardware. However, existing controllers are commonly driven either by predefined motion trajectories, which offer limited flexibility when user intent changes, or by continuous human teleoperation, which requires constant human involvement and limits autonomy. This work addresses the problem of how to drive a universal humanoid controller in a real-time and interactive manner. We present TextOp, a real-time text-driven humanoid motion generation and control framework that supports streaming language commands and on-the-fly instruction modification during execution. TextOp adopts a two-level architecture in which a high-level autoregressive motion diffusion model continuously generates short-horizon kinematic trajectories conditioned on the current text input, while a low-level motion tracking policy executes these trajectories on a physical humanoid robot. By bridging interactive motion generation with robust whole-body control, TextOp unlocks free-form intent expression and enables smooth transitions across multiple challenging behaviors such as dancing and jumping, within a single continuous motion execution. Extensive real-robot experiments and offline evaluations demonstrate instant responsiveness, smooth whole-body motion, and precise control. The project page and the open-source code are available at https://text-op.github.io/
翻译:近期人形机器人全身运动跟踪技术的进展使得在真实硬件上执行多样化且高度协调的运动成为可能。然而,现有控制器通常由预定义的运动轨迹驱动(当用户意图改变时灵活性有限),或由连续的人类遥操作驱动(需要持续人工参与且限制了自主性)。本研究致力于解决如何以实时交互方式驱动通用人形控制器的问题。我们提出TextOp——一个支持执行过程中流式语言指令输入与实时指令修改的文本驱动人形机器人运动生成与控制框架。TextOp采用双层架构:高层自回归运动扩散模型基于当前文本输入持续生成短时域运动学轨迹,底层运动跟踪策略则在物理人形机器人上执行这些轨迹。通过将交互式运动生成与鲁棒的全身控制相结合,TextOp实现了自由形式的意图表达,并能在单次连续运动执行中流畅切换舞蹈、跳跃等多种挑战性行为。大量真实机器人实验与离线评估验证了系统的即时响应性、流畅全身运动表现与精确控制能力。项目页面与开源代码详见 https://text-op.github.io/