We introduce X&Fuse, a general approach for conditioning on visual information when generating images from text. We demonstrate the potential of X&Fuse in three different text-to-image generation scenarios. (i) When a bank of images is available, we retrieve and condition on a related image (Retrieve&Fuse), resulting in significant improvements on the MS-COCO benchmark, gaining a state-of-the-art FID score of 6.65 in zero-shot settings. (ii) When cropped-object images are at hand, we utilize them and perform subject-driven generation (Crop&Fuse), outperforming the textual inversion method while being more than x100 faster. (iii) Having oracle access to the image scene (Scene&Fuse), allows us to achieve an FID score of 5.03 on MS-COCO in zero-shot settings. Our experiments indicate that X&Fuse is an effective, easy-to-adapt, simple, and general approach for scenarios in which the model may benefit from additional visual information.
翻译:我们提出X&Fuse——一种在文本生成图像过程中注入视觉信息的通用方法。在三种不同的文生图场景中验证了X&Fuse的潜力:(i)当存在图像库时,检索并融合相关图像(Retrieve&Fuse),在MS-COCO基准上实现显著提升,FID得分在零样本设置下达到6.65,刷新了当前最优记录;(ii)当存在裁剪物体图像时,利用这些图像进行主题驱动生成(Crop&Fuse),性能超越文本反转方法,同时速度提升100倍以上;(iii)当具备场景先验信息时(Scene&Fuse),在MS-COCO零样本设置下实现5.03的FID得分。实验表明,X&Fuse是一种有效、易适配、简洁通用的方法,适用于模型可能受益于额外视觉信息的各类场景。