We introduce Wuerstchen, a novel technique for text-to-image synthesis that unites competitive performance with unprecedented cost-effectiveness and ease of training on constrained hardware. Building on recent advancements in machine learning, our approach, which utilizes latent diffusion strategies at strong latent image compression rates, significantly reduces the computational burden, typically associated with state-of-the-art models, while preserving, if not enhancing, the quality of generated images. Wuerstchen achieves notable speed improvements at inference time, thereby rendering real-time applications more viable. One of the key advantages of our method lies in its modest training requirements of only 9,200 GPU hours, slashing the usual costs significantly without compromising the end performance. In a comparison against the state-of-the-art, we found the approach to yield strong competitiveness. This paper opens the door to a new line of research that prioritizes both performance and computational accessibility, hence democratizing the use of sophisticated AI technologies. Through Wuerstchen, we demonstrate a compelling stride forward in the realm of text-to-image synthesis, offering an innovative path to explore in future research.
翻译:我们提出Wuerstchen,一种新颖的文本到图像合成技术,它在受限硬件上兼具竞争性性能与前所未有的成本效益和训练便捷性。基于机器学习的最新进展,我们的方法在强潜在图像压缩率下采用潜在扩散策略,显著降低了通常与最先进模型相关的计算负担,同时保持甚至提升了生成图像的质量。Wuerstchen在推理阶段实现了显著的速度提升,从而使实时应用更具可行性。该方法的关键优势之一在于其仅需9200 GPU小时的适度训练需求,大幅削减了常规成本而不影响最终性能。与当前最先进方法的对比表明,该方法具有强大的竞争力。本文为兼顾性能与计算可及性的新研究方向打开了大门,从而推动了先进AI技术的民主化。通过Wuerstchen,我们在文本到图像合成领域实现了引人瞩目的进步,为未来研究提供了一条创新探索路径。