Recent advances in robot foundation models trained on large-scale human teleoperation data have enabled robots to perform increasingly complex real-world tasks. However, scaling these systems remains difficult because collecting task-specific demonstrations is expensive and labor-intensive. Synthetic data, especially generated videos, offer a promising direction, but existing World Models (WMs) are not directly suitable for policy learning since they do not provide paired action trajectories. World-Action (WA) models partially address this by predicting actions with visual outputs, yet often lack strong video-action alignment, while two-stage pipelines that generate video first and then infer actions introduce inefficiency and error accumulation. To address these limitations, we propose VAG, a unified flow-matching-based dual-stream framework that jointly generates video and action under visual and language conditioning. By synchronizing denoising in both branches and using an adaptive 3D pooling mechanism to transfer compact global video context to the action branch, VAG improves cross-modal consistency during generation. Across both simulated and real-world settings, VAG produces aligned video-action pairs with competitive prediction quality, supports executable trajectory replay, and provides useful synthetic pretraining data that improves downstream policy generalization, indicating its potential as a practical world-action model for embodied data synthesis.
翻译:近年来,基于大规模人类遥操作数据训练的机器人基础模型取得了显著进展,使机器人能够执行日益复杂的真实世界任务。然而,由于收集特定任务的演示数据成本高昂且劳动密集,这些系统的规模化仍面临挑战。合成数据(特别是生成的视频)提供了有前景的方向,但现有世界模型(WMs)因不提供配对的动作轨迹而难以直接适用于策略学习。世界-动作(WA)模型通过预测动作并输出视觉结果部分解决了这一问题,但往往缺乏强视频-动作对齐能力,而先生成视频再推断动作的两阶段流水线则引入了低效性和误差积累。为克服这些局限,我们提出VAG——一种基于流匹配的统一双流框架,能够在视觉和语言条件约束下联合生成视频与动作。通过同步两个分支的去噪过程,并利用自适应3D池化机制将紧凑的全局视频上下文传递至动作分支,VAG在生成过程中提升了跨模态一致性。在模拟环境与真实场景实验中,VAG生成的视频-动作对不仅具备具有竞争力的预测质量,还能支持可执行的轨迹回放,并提供有益的合成预训练数据以改进下游策略泛化能力,这充分表明其作为具身数据合成中实用世界-动作模型的潜力。