Amazing Combinatorial Creation: Acceptable Swap-Sampling for Text-to-Image Generation

Exploring a machine learning system to generate meaningful combinatorial object images from multiple textual descriptions, emulating human creativity, is a significant challenge as humans are able to construct amazing combinatorial objects, but machines strive to emulate data distribution. In this paper, we develop a straight-forward yet highly effective technique called acceptable swap-sampling to generate a combinatorial object image that exhibits novelty and surprise, utilizing text concepts of different objects. Initially, we propose a swapping mechanism that constructs a novel embedding by exchanging column vectors of two text embeddings for generating a new combinatorial image through a cutting-edge diffusion model. Furthermore, we design an acceptable region by managing suitable CLIP distances between the new image and the original concept generations, increasing the likelihood of accepting the new image with a high-quality combination. This region allows us to efficiently sample a small subset from a new image pool generated by using randomly exchanging column vectors. Lastly, we employ a segmentation method to compare CLIP distances among the segmented components, ultimately selecting the most promising object image from the sampled subset. Our experiments focus on text pairs of objects from ImageNet, and our results demonstrate that our approach outperforms recent methods such as Stable-Diffusion2, DALLE2, ERNIE-ViLG2 and Bing in generating novel and surprising object images, even when the associated concepts appear to be implausible, such as lionfish-abacus. Moreover, during the sampling process, our approach without training and human preference is also comparable to PickScore and HPSv2 trained using human preference datasets.

翻译：探索一个能够从多个文本描述中生成有意义组合对象图像的机器学习系统，以模拟人类创造力，是一项重大挑战——因为人类能够构建令人惊叹的组合对象，而机器却难以超越数据分布的模仿。本文提出了一种直接而高效的技术，称为"可接受交换采样"，通过利用不同对象的文本概念，生成具有新颖性和惊喜感的组合对象图像。首先，我们提出一种交换机制：通过交换两个文本嵌入的列向量来构建新颖嵌入，并通过前沿扩散模型生成新的组合图像。其次，我们通过管理新图像与原始概念生成之间的合适CLIP距离，设计了一个可接受区域，从而提升接受高质量组合新图像的概率。该区域使我们能够从随机交换列向量生成的新图像池中高效采样少量子集。最后，我们采用分割方法比较各分割组件的CLIP距离，从采样子集中选出最具潜力的对象图像。实验聚焦于ImageNet中对象的文本对，结果表明：即使相关概念看似不合逻辑（如"狮鱼-算盘"），我们的方法在生成新颖且令人惊喜的对象图像方面，仍优于近期方法（如Stable-Diffusion2、DALLE2、ERNIE-ViLG2和Bing）。此外，在无需训练和人工偏好的采样过程中，其效果也与基于人工偏好数据集训练的PickScore和HPSv2相当。