Due to the difficulty in scaling up, generative adversarial networks (GANs) seem to be falling from grace on the task of text-conditioned image synthesis. Sparsely-activated mixture-of-experts (MoE) has recently been demonstrated as a valid solution to training large-scale models with limited computational resources. Inspired by such a philosophy, we present Aurora, a GAN-based text-to-image generator that employs a collection of experts to learn feature processing, together with a sparse router to help select the most suitable expert for each feature point. To faithfully decode the sampling stochasticity and the text condition to the final synthesis, our router adaptively makes its decision by taking into account the text-integrated global latent code. At 64x64 image resolution, our model trained on LAION2B-en and COYO-700M achieves 6.2 zero-shot FID on MS COCO. We release the code and checkpoints to facilitate the community for further development.
翻译:由于扩展困难,生成对抗网络(GANs)在文本条件图像合成任务中似乎逐渐失去优势。稀疏激活的混合专家模型(MoE)近期被证明是在有限计算资源下训练大规模模型的有效解决方案。受此思想启发,我们提出Aurora——一种基于GAN的文本到图像生成器。该模型采用一组专家来学习特征处理,并借助稀疏路由器为每个特征点选择最合适的专家。为确保将采样随机性和文本条件准确解码到最终合成结果中,我们的路由器会综合考虑文本集成的全局潜在编码,自适应地做出决策。在64x64图像分辨率下,基于LAION2B-en和COYO-700M数据集训练的模型在MS COCO上实现了6.2的零样本FID分数。我们已开源代码和模型检查点,以促进社区的进一步发展。