Diffusion-based models have shown great potential in generating high-quality images with various layouts, which can benefit downstream perception tasks. However, a fully automatic layout generation driven only by language and a suitable metric for measuring multiple generated instances has not been well explored. In this work, we present Auto Cherry-Picker (ACP), a novel framework that generates high-quality multi-modal training examples to augment perception and multi-modal training. Starting with a simple list of natural language concepts, we prompt large language models (LLMs) to generate a detailed description and design reasonable layouts. Next, we use an off-the-shelf text-to-image model to generate multiple images. Then, the generated data are refined using a comprehensively designed metric to ensure quality. In particular, we present a new metric, Composite Layout and Image Score (CLIS), to evaluate the generated images fairly. Our synthetic high-quality examples boost performance in various scenarios by customizing the initial concept list, especially in addressing challenges associated with long-tailed distribution and imbalanced datasets. Experiment results on downstream tasks demonstrate that Auto Cherry-Picker can significantly improve the performance of existing models. In addition, we have thoroughly investigated the correlation between CLIS and performance gains in downstream tasks, and we find that a better CLIS score results in better performance. This finding shows the potential for evaluation metrics as the role for various visual perception and MLLM tasks. Code will be available.
翻译:基于扩散的模型在生成具有多样化布局的高质量图像方面展现出巨大潜力,这有利于下游感知任务。然而,仅由语言驱动的全自动布局生成,以及用于衡量多个生成实例的合适度量标准,尚未得到充分探索。在本工作中,我们提出了Auto Cherry-Picker(ACP),这是一个新颖的框架,用于生成高质量的多模态训练样本来增强感知和多模态训练。从一个简单的自然语言概念列表出发,我们提示大语言模型(LLMs)生成详细描述并设计合理的布局。接着,我们使用现成的文生图模型生成多张图像。然后,使用一个全面设计的度量标准对生成的数据进行精炼,以确保质量。特别地,我们提出了一种新的度量标准——复合布局与图像分数(CLIS),以公平地评估生成的图像。我们合成的高质量示例通过定制初始概念列表,在各种场景中提升了性能,尤其是在应对与长尾分布和不平衡数据集相关的挑战方面。下游任务的实验结果表明,Auto Cherry-Picker能够显著提升现有模型的性能。此外,我们深入研究了CLIS与下游任务性能增益之间的相关性,发现更好的CLIS分数会带来更好的性能。这一发现展示了评估指标在各种视觉感知和MLLM任务中作为衡量角色的潜力。代码将公开可用。