We present aMUSEd, an open-source, lightweight masked image model (MIM) for text-to-image generation based on MUSE. With 10 percent of MUSE's parameters, aMUSEd is focused on fast image generation. We believe MIM is under-explored compared to latent diffusion, the prevailing approach for text-to-image generation. Compared to latent diffusion, MIM requires fewer inference steps and is more interpretable. Additionally, MIM can be fine-tuned to learn additional styles with only a single image. We hope to encourage further exploration of MIM by demonstrating its effectiveness on large-scale text-to-image generation and releasing reproducible training code. We also release checkpoints for two models which directly produce images at 256x256 and 512x512 resolutions.
翻译:本文提出aMUSEd——基于MUSE的开源轻量级掩码图像模型(MIM),用于文本到图像生成。仅需MUSE模型10%的参数规模,aMUSEd专注于快速图像生成。我们认为,与当前主流的潜在扩散方法相比,MIM尚未得到充分探索。相较于潜在扩散,MIM所需推理步数更少且可解释性更强。此外,MIM可通过单张图像微调学习新的艺术风格。我们通过在大型文本到图像生成任务中验证其有效性,并发布可复现训练代码,期望推动MIM的进一步研究。同时发布两个可直接生成256×256和512×512分辨率图像的模型检查点。