SCRIPT: Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-based Humanoid Control

Controlling physics-based humanoids from natural-language instructions is a critical step toward general-purpose embodied agents. However, existing methods remain constrained by a tension between semantic expressiveness and physical feasibility, often failing to jointly achieve faithful instruction following, high-quality motion, and stable long-horizon control. We propose SCRIPT, a scalable diffusion policy with a multi-stage training framework for language-driven physics-based humanoid control. The core of SCRIPT is a Joint Action-State-Text Diffusion Transformer (JAST-DiT), which represents actions, physical states, and text as dedicated token streams and couples them through joint attention, enabling direct interaction between language semantics and control dynamics. To stabilize autoregressive control, we introduce a nonlinear history conditioning mechanism, which preserves the dense recent context and samples increasingly sparse cues from long-term history. Beyond supervised imitation pre-training, we propose a post-training stage, further improving the performance using Reinforcement Learning with Hybrid Rewards (RLHR). By injecting learnable noise into the flow-sampling process, RLHR effectively improves motion quality and instruction following within closed-loop simulations using hybrid physical feedback and text rewards. Quantitative evaluations demonstrate that SCRIPT outperforms prior state-of-the-art methods, with gains across text alignment, motion quality, and physical realism metrics. Furthermore, scaling studies on the 1200-hour MotionMillion dataset demonstrate consistent performance gains with model scaling, highlighting SCRIPT's robust scalability for large-scale pre-training. Our code will be publicly available for future research.

翻译：从自然语言指令控制物理仿真人形机器人，是迈向通用具身智能体的关键一步。然而，现有方法受限于语义表达力与物理可行性之间的张力，往往难以同时实现指令跟随的忠实性、动作的高质量以及长期控制的稳定性。我们提出SCRIPT——一种面向语言驱动物理人形控制的 scalable 扩散策略，采用多阶段训练框架。其核心是联合动作-状态-文本扩散Transformer（JAST-DiT），将动作、物理状态与文本编码为专用标记流，并通过联合注意力机制进行耦合，使语言语义与控制动力学直接交互。为稳定自回归控制，我们引入非线性历史条件机制，保留密集近期上下文，并从长期历史中采样逐渐稀疏的线索。除监督式模仿预训练外，我们提出后训练阶段，通过混合奖励强化学习（RLHR）进一步优化性能。RLHR将可学习噪声注入流采样过程，结合物理反馈与文本奖励的混合信号，在闭环仿真中有效提升动作质量与指令跟随精度。定量评估表明，SCRIPT在文本对齐度、动作质量与物理真实感指标上全面超越现有最优方法。此外，基于1200小时MotionMillion数据集的扩展性研究显示，模型规模增长可带来持续性能提升，充分验证SCRIPT在大规模预训练中的鲁棒可扩展性。相关代码将开源供后续研究使用。