AnySynth: Harnessing the Power of Image Synthetic Data Generation for Generalized Vision-Language Tasks

Diffusion models have recently been employed to generate high-quality images, reducing the need for manual data collection and improving model generalization in tasks such as object detection, instance segmentation, and image perception. However, the synthetic framework is usually designed with meticulous human effort for each task due to various requirements on image layout, content, and annotation formats, restricting the application of synthetic data on more general scenarios. In this paper, we propose AnySynth, a unified framework integrating adaptable, comprehensive, and highly controllable components capable of generating an arbitrary type of synthetic data given diverse requirements. Specifically, the Task-Specific Layout Generation Module is first introduced to produce reasonable layouts for different tasks by leveraging the generation ability of large language models and layout priors of real-world images. A Uni-Controlled Image Generation Module is then developed to create high-quality synthetic images that are controllable and based on the generated layouts. In addition, user specific reference images, and style images can be incorporated into the generation to task requirements. Finally, the Task-Oriented Annotation Module offers precise and detailed annotations for the generated images across different tasks. We have validated our framework's performance across various tasks, including Few-shot Object Detection, Cross-domain Object Detection, Zero-shot Composed Image Retrieval, and Multi-modal Image Perception and Grounding. The specific data synthesized by our framework significantly improves model performance in these tasks, demonstrating the generality and effectiveness of our framework.

翻译：扩散模型近期被用于生成高质量图像，减少了对人工数据收集的需求，并在目标检测、实例分割和图像感知等任务中提升了模型的泛化能力。然而，由于不同任务对图像布局、内容及标注格式的多样化要求，合成框架通常需要针对每个任务进行精细的人工设计，这限制了合成数据在更通用场景中的应用。本文提出AnySynth，一个集成了可适应、全面且高度可控组件的统一框架，能够根据多样化需求生成任意类型的合成数据。具体而言，首先引入任务特定布局生成模块，通过利用大语言模型的生成能力和真实图像的布局先验，为不同任务生成合理的布局。随后开发统一可控图像生成模块，基于生成的布局创建高质量且可控的合成图像。此外，可根据任务需求将用户指定的参考图像和风格图像融入生成过程。最后，面向任务的标注模块为生成图像提供跨不同任务的精确细致标注。我们在少样本目标检测、跨域目标检测、零样本组合图像检索以及多模态图像感知与定位等多种任务上验证了框架性能。由本框架合成的特定数据在这些任务中显著提升了模型表现，证明了框架的通用性和有效性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日