Modern text-to-image synthesis models have achieved an exceptional level of photorealism, generating high-quality images from arbitrary text descriptions. In light of the impressive synthesis ability, several studies have exhibited promising results in exploiting generated data for image recognition. However, directly supplementing data-hungry situations in the real-world (e.g. few-shot or long-tailed scenarios) with existing approaches result in marginal performance gains, as they suffer to thoroughly reflect the distribution of the real data. Through extensive experiments, this paper proposes a new image synthesis pipeline for long-tailed situations using Textual Inversion. The study demonstrates that generated images from textual-inverted text tokens effectively aligns with the real domain, significantly enhancing the recognition ability of a standard ResNet50 backbone. We also show that real-world data imbalance scenarios can be successfully mitigated by filling up the imbalanced data with synthetic images. In conjunction with techniques in the area of long-tailed recognition, our method achieves state-of-the-art results on standard long-tailed benchmarks when trained from scratch.
翻译:现代文本到图像合成模型已实现卓越的逼真度,能够从任意文本描述生成高质量图像。鉴于其强大的合成能力,多项研究已展示出利用生成数据提升图像识别的潜力。然而,直接使用现有方法补充真实世界中的数据匮乏场景(例如小样本或长尾场景)仅能带来边际性能提升,因为这些方法难以充分反映真实数据的分布。通过大量实验,本文提出了一种基于文本反转的长尾场景图像合成流水线。研究表明,从文本反转标记生成的图像能有效对齐真实数据域,显著增强标准ResNet50骨干网络的识别能力。我们还证明,通过用合成图像填补不平衡数据,可以成功缓解真实世界中的数据不平衡问题。结合长尾识别领域的技术,我们的方法在标准长尾基准数据集上(从零开始训练)取得了最优结果。