Recently, video generation has achieved significant rapid development based on superior text-to-image generation techniques. In this work, we propose a high fidelity framework for image-to-video generation, named AtomoVideo. Based on multi-granularity image injection, we achieve higher fidelity of the generated video to the given image. In addition, thanks to high quality datasets and training strategies, we achieve greater motion intensity while maintaining superior temporal consistency and stability. Our architecture extends flexibly to the video frame prediction task, enabling long sequence prediction through iterative generation. Furthermore, due to the design of adapter training, our approach can be well combined with existing personalized models and controllable modules. By quantitatively and qualitatively evaluation, AtomoVideo achieves superior results compared to popular methods, more examples can be found on our project website: https://atomo-video.github.io/.
翻译:近期,基于卓越的文生图技术,视频生成领域取得了显著进展。本文提出了一种名为AtomoVideo的高保真图像到视频生成框架。通过多粒度图像注入策略,我们实现了生成视频与给定图像之间更高的保真度。此外,得益于高质量数据集与训练策略,我们在保持优异时间一致性和稳定性的同时,实现了更强的运动强度。该架构可灵活扩展至视频帧预测任务,通过迭代生成实现长序列预测。同时,由于适配器训练的设计,我们的方法能很好地与现有个性化模型及可控模块结合。通过定量与定性评估,AtomoVideo在主流方法中展现出更优性能,更多示例可参见项目网站:https://atomo-video.github.io/。