VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents

Recent progress in video-to-video (V2V) translation has enabled realistic resimulation of embodied AI demonstrations, a capability that allows pretrained robot policies to be transferable to new environments without additional data collection. However, prior works can only operate on a single view at a time, while embodied AI tasks are commonly captured from multiple synchronized cameras to support policy learning. Naively applying single-view models independently to each camera leads to inconsistent appearance across views, and standard transformer architectures do not scale to multi-view settings due to the quadratic cost of cross-view attention. We present VideoWeaver, the first multimodal multi-view V2V translation framework. VideoWeaver is initially trained as a single-view flow-based V2V model. To achieve an extension to the multi-view regime, we propose to ground all views in a shared 4D latent space derived from a feed-forward spatial foundation model, namely, Pi3. This encourages view-consistent appearance even under wide baselines and dynamic camera motion. To scale beyond a fixed number of cameras, we train views at distinct diffusion timesteps, enabling the model to learn both joint and conditional view distributions. This in turn allows autoregressive synthesis of new viewpoints conditioned on existing ones. Experiments show superior or similar performance to the state-of-the-art on the single-view translation benchmarks and, for the first time, physically and stylistically consistent multi-view translations, including challenging egocentric and heterogeneous-camera setups central to world randomization for robot learning.

翻译：近期视频到视频（V2V）翻译的进展实现了对具身人工智能演示的逼真重仿真，这一能力使得预训练的机器人策略无需额外数据收集即可迁移至新环境。然而，现有工作每次仅能处理单一视角，而具身人工智能任务通常需要多个同步摄像机捕捉以支持策略学习。将单视角模型独立应用于每个摄像机会导致跨视角外观不一致，标准Transformer架构因跨视角注意力的二次计算开销而无法扩展至多视角场景。我们提出VideoWeaver——首个多模态多视角V2V翻译框架。VideoWeaver首先被训练为基于流的单视角V2V模型。为扩展至多视角方案，我们提出将所有视角锚定至由前馈空间基础模型Pi3生成的共享4D隐空间。即便在大基线距与动态相机运动条件下，该设计仍能促进视角一致性外观。为突破固定相机数量限制，我们在不同扩散时间步训练不同视角，使模型同时学习联合视角分布与条件视角分布，进而支持以自动回归方式合成基于现有视角的新视角。实验表明：在单视角翻译基准测试中，该方法性能达到或超越现有最优水平；首次实现了物理与风格一致的多视角翻译，涵盖对机器人学习中的世界随机化至关重要的挑战性自我中心视角与异构相机配置。