DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

Vision-Language-Action (VLA) models have emerged as a promising paradigm for robot learning, but their representations are still largely inherited from static image-text pretraining, leaving physical dynamics to be learned from comparatively limited action data. Generative video models, by contrast, encode rich spatiotemporal structure and implicit physics, making them a compelling foundation for robotic manipulation. But their potentials are not fully explored in the literature. To bridge the gap, we introduce DiT4DiT, an end-to-end Video-Action Model that couples a video Diffusion Transformer with an action Diffusion Transformer in a unified cascaded framework. Instead of relying on reconstructed future frames, DiT4DiT extracts intermediate denoising features from the video generation process and uses them as temporally grounded conditions for action prediction. We further propose a dual flow-matching objective with decoupled timesteps and noise scales for video prediction, hidden-state extraction, and action inference, enabling coherent joint training of both modules. Across simulation and real-world benchmarks, DiT4DiT achieves state-of-the-art results, reaching average success rates of 98.6% on LIBERO and 50.8% on RoboCasa GR1 while using substantially less training data. On the Unitree G1 robot, it also delivers superior real-world performance and strong zero-shot generalization. Importantly, DiT4DiT improves sample efficiency by over 10x and speeds up convergence by up to 7x, demonstrating that video generation can serve as an effective scaling proxy for robot policy learning. We release code and models at https://dit4dit.github.io/.

翻译：视觉-语言-动作模型已成为机器人学习的重要范式，但其表征仍主要继承自静态图像-文本预训练，物理动态信息需从相对有限的动作数据中学习。相比之下，生成式视频模型编码了丰富的时空结构与隐式物理规律，成为机器人操作的强有力基础，但现有文献未能充分挖掘其潜力。为弥合这一差距，我们提出DiT4DiT——一种端到端视频-动作模型，通过统一级联框架将视频扩散Transformer与动作扩散Transformer耦合。不同于依赖重建的未来帧，DiT4DiT从视频生成过程中提取中间去噪特征，并将其作为时间锚定条件用于动作预测。我们进一步提出双流流匹配目标，为视频预测、隐状态提取和动作推理分别设置解耦的时间步与噪声尺度，实现两个模块的协同联合训练。在仿真与真实世界基准测试中，DiT4DiT取得了最先进成果：在LIBERO上平均成功率达98.6%，在RoboCasa GR1上达50.8%，且使用了显著更少的训练数据。在Unitree G1机器人上，它同样展现出卓越的真实世界性能与强零样本泛化能力。重要的是，DiT4DiT将样本效率提升超过10倍，收敛速度加快高达7倍，证明视频生成可作为机器人策略学习的有效缩放代理。我们已在https://dit4dit.github.io/ 发布代码与模型。