We present Paris 2.0, the first video generation model pre-trained through decentralized computation. Its training recipe builds upon Paris 1.0 (arXiv:2510.03434), the first ever open-weight Decentralized Diffusion Model (DDM), which showed that image generation can be trained without a monolithic GPU cluster. However, temporally coherent video generation had remained an open problem under decentralized training, and Paris 2.0 closes it. In low-resolution text-to-video training, against a monolithic model trained on the same data under a matched total compute budget, Paris 2.0 cuts Frechet Video Distance (FVD) from 561.04 to 279.01, a ~2.0x improvement, and lifts CLIP text-video similarity and aesthetic score.
翻译:我们提出Paris 2.0——首个通过去中心化计算预训练的视频生成模型。其训练方案基于Paris 1.0(arXiv:2510.03434),即首个开源权重的去中心化扩散模型(DDM),该模型证明了图像生成无需单片式GPU集群即可完成训练。然而,时间连贯的视频生成在去中心化训练条件下仍为悬而未决的问题,而Paris 2.0解决了这一挑战。在低分辨率文本到视频训练中,相较于在相同数据与匹配总计算预算下训练的单片式模型,Paris 2.0将Fréchet视频距离(FVD)从561.04降至279.01(提升约2.0倍),同时提升了CLIP文本-视频相似度与美学评分。