In this study, we explore Transformer-based diffusion models for image and video generation. Despite the dominance of Transformer architectures in various fields due to their flexibility and scalability, the visual generative domain primarily utilizes CNN-based U-Net architectures, particularly in diffusion-based models. We introduce GenTron, a family of Generative models employing Transformer-based diffusion, to address this gap. Our initial step was to adapt Diffusion Transformers (DiTs) from class to text conditioning, a process involving thorough empirical exploration of the conditioning mechanism. We then scale GenTron from approximately 900M to over 3B parameters, observing significant improvements in visual quality. Furthermore, we extend GenTron to text-to-video generation, incorporating novel motion-free guidance to enhance video quality. In human evaluations against SDXL, GenTron achieves a 51.1% win rate in visual quality (with a 19.8% draw rate), and a 42.3% win rate in text alignment (with a 42.9% draw rate). GenTron also excels in the T2I-CompBench, underscoring its strengths in compositional generation. We believe this work will provide meaningful insights and serve as a valuable reference for future research.
翻译:本研究探索了基于Transformer的扩散模型在图像与视频生成领域的应用。尽管Transformer架构因其灵活性和可扩展性已在多个领域占据主导地位,但在视觉生成领域,尤其是基于扩散的模型中,主要仍采用基于CNN的U-Net架构。为弥补这一空白,我们提出了GenTron——一个采用基于Transformer的扩散方法的生成模型家族。我们的首要步骤是将扩散Transformer(DiTs)从类别条件生成适配至文本条件生成,这一过程涉及对条件机制进行全面的实证探索。随后,我们将GenTron的参数量从约9亿扩展至超过30亿,观察到视觉质量显著提升。此外,我们将GenTron扩展至文本到视频生成任务,并引入新颖的无运动引导技术以提升视频质量。在与SDXL进行的人工评估中,GenTron在视觉质量方面获得了51.1%的胜率(平局率为19.8%),在文本对齐方面获得42.3%的胜率(平局率为42.9%)。GenTron在T2I-CompBench基准测试中也表现优异,突显了其在组合生成方面的优势。我们相信这项工作将为未来研究提供有价值的见解和参考。