We introduce Vidu, a high-performance text-to-video generator that is capable of producing 1080p videos up to 16 seconds in a single generation. Vidu is a diffusion model with U-ViT as its backbone, which unlocks the scalability and the capability for handling long videos. Vidu exhibits strong coherence and dynamism, and is capable of generating both realistic and imaginative videos, as well as understanding some professional photography techniques, on par with Sora -- the most powerful reported text-to-video generator. Finally, we perform initial experiments on other controllable video generation, including canny-to-video generation, video prediction and subject-driven generation, which demonstrate promising results.
翻译:我们提出Vidu,一种高性能文生视频生成器,能够单次生成高达16秒的1080p分辨率视频。Vidu采用U-ViT作为主干网络的扩散模型,这解锁了模型对长视频的可扩展性与处理能力。Vidu展现出强大的连贯性与动态表现力,既能生成逼真视频也能生成富有想象力的作品,同时理解专业摄影技法,性能与目前最先进的Sora文生视频生成器持平。最后,我们开展了其他可控视频生成的初步实验,包括Canny边缘图生视频、视频预测与主体驱动生成,均展现出令人鼓舞的结果。