Recent advances in multimodal models have demonstrated remarkable text-guided image editing capabilities, with systems like GPT-4o and Nano-Banana setting new benchmarks. However, the research community's progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built from real images. We introduce Pico-Banana-400K, a comprehensive 400K-image dataset for instruction-based image editing. Our dataset is constructed by leveraging Nano-Banana to generate diverse edit pairs from real photographs in the OpenImages collection. What distinguishes Pico-Banana-400K from previous synthetic datasets is our systematic approach to quality and diversity. We employ a fine-grained image editing taxonomy to ensure comprehensive coverage of edit types while maintaining precise content preservation and instruction faithfulness through MLLM-based quality scoring and careful curation. Beyond single turn editing, Pico-Banana-400K enables research into complex editing scenarios. The dataset includes three specialized subsets: (1) a 72K-example multi-turn collection for studying sequential editing, reasoning, and planning across consecutive modifications; (2) a 56K-example preference subset for alignment research and reward model training; and (3) paired long-short editing instructions for developing instruction rewriting and summarization capabilities. By providing this large-scale, high-quality, and task-rich resource, Pico-Banana-400K establishes a robust foundation for training and benchmarking the next generation of text-guided image editing models.
翻译:多模态模型的最新进展展示了卓越的文本引导图像编辑能力,诸如GPT-4o和Nano-Banana等系统设定了新的基准。然而,由于缺乏基于真实图像构建的大规模、高质量且可公开访问的数据集,研究社区的进展仍然受到限制。我们推出了Pico-Banana-400K,这是一个用于基于指令的图像编辑的综合性40万图像数据集。我们的数据集通过利用Nano-Banana从OpenImages集合中的真实照片生成多样化的编辑对来构建。Pico-Banana-400K区别于先前合成数据集之处在于我们对质量和多样性的系统化方法。我们采用细粒度的图像编辑分类法,以确保全面覆盖编辑类型,同时通过基于MLLM的质量评分和精心策划,保持精确的内容保留和指令忠实度。除了单轮编辑,Pico-Banana-400K还支持对复杂编辑场景的研究。该数据集包含三个专门的子集:(1)一个包含7.2万个示例的多轮编辑集合,用于研究连续修改中的顺序编辑、推理和规划;(2)一个包含5.6万个示例的偏好子集,用于对齐研究和奖励模型训练;以及(3)配对的“长-短”编辑指令,用于开发指令重写和摘要生成能力。通过提供这一大规模、高质量且任务丰富的资源,Pico-Banana-400K为训练和基准测试下一代文本引导图像编辑模型奠定了坚实的基础。