Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the extreme desire for better visual experience and the rapid development of imaging technology, the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However, UHR image generation poses great challenges due to the scarcity and complexity of high-resolution content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios (each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to native 100MP generation with three training schemes. Finally, leveraging both conventional metrics and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and semantic alignment. Extensive experimental results on our benchmark and the constructive exploration of training strategies collaboratively provide valuable insights for future breakthroughs.
翻译:文本到图像(T2I)模型近期在1K和2K分辨率上取得了显著进展。随着对更优视觉体验的强烈需求以及成像技术的快速发展,超高清(UHR)图像生成的需求大幅增长。然而,由于高分辨率内容的稀缺性和复杂性,UHR图像生成面临巨大挑战。本文首先介绍了PixVerve-95K,这是一个通过精心设计的数据管道构建的高质量、开源UHR T2I数据集,包含涵盖多样场景的95K张图像(每张图像的最小像素数达100M)以及七维标注。基于我们的大规模图像-文本数据集,我们率先迈出一步,通过三种训练方案将多种T2I基础模型扩展到原生100MP生成。最后,结合传统指标与基于多模态大语言模型的评估,我们提出的PixVerve-Bench基准建立了一套涵盖视觉质量和语义对齐的UHR图像综合评估协议。在基准上的大量实验结果以及对训练策略的建设性探索,共同为未来突破提供了宝贵见解。