StyleGANs are at the forefront of controllable image generation as they produce a latent space that is semantically disentangled, making it suitable for image editing and manipulation. However, the performance of StyleGANs severely degrades when trained via class-conditioning on large-scale long-tailed datasets. We find that one reason for degradation is the collapse of latents for each class in the $\mathcal{W}$ latent space. With NoisyTwins, we first introduce an effective and inexpensive augmentation strategy for class embeddings, which then decorrelates the latents based on self-supervision in the $\mathcal{W}$ space. This decorrelation mitigates collapse, ensuring that our method preserves intra-class diversity with class-consistency in image generation. We show the effectiveness of our approach on large-scale real-world long-tailed datasets of ImageNet-LT and iNaturalist 2019, where our method outperforms other methods by $\sim 19\%$ on FID, establishing a new state-of-the-art.
翻译:StyleGANs处于可控图像生成的前沿,因其产生的潜在空间具有语义解耦特性,适用于图像编辑和操作。然而,当基于类别条件在大规模长尾数据集上训练时,StyleGANs的性能会严重下降。我们发现性能下降的一个原因是$\mathcal{W}$潜在空间中每个类别的潜在向量发生坍缩。通过NoisyTwins,我们首先引入一种有效且低成本的类别嵌入增强策略,随后基于$\mathcal{W}$空间中的自监督机制对潜在向量进行去相关。这种去相关缓解了坍缩,确保我们的方法在图像生成中保持类别一致性的同时保留类内多样性。我们在ImageNet-LT和iNaturalist 2019等大规模真实长尾数据集上展示了方法的有效性,其中我们的方法在FID指标上优于其他方法约$\sim 19\%$,确立了新的最优水平。