We present SciDraw-6K, a curated dataset of 6,291 scientific illustrations synthesized by Google Gemini image-generation models, each paired with prompts in eleven languages (English, Simplified Chinese, Traditional Chinese, Japanese, Korean, German, French, Spanish, Brazilian Portuguese, Italian, and Russian). Images span eight broad scientific categories -- biomedical, chemistry, materials, electronics, environment, AI systems, physics, and a long "other" tail -- and are produced primarily by the gemini-2.5-flash-image and gemini-3-pro-image-preview model families. In contrast to general-purpose text-to-image corpora that dominate the literature, SciDraw-6K is purpose-built for the scientific illustration genre: schematic diagrams, mechanism figures, table-of-contents graphics, and conceptual posters. We describe the construction pipeline, report dataset statistics, and document its use as the substrate of sci-draw.com, a public scientific drawing service. The dataset is released to support multilingual text-to-image research, domain-adapted diffusion fine-tuning, and prompt-engineering studies for scientific visualization. Dataset: https://huggingface.co/datasets/SciDrawAI/SciDraw-6K Code: https://github.com/SciDrawAI/scidraw-6k
翻译:我们推出SciDraw-6K,一个由6,291幅科学插画组成的精选数据集,这些插画由Google Gemini图像生成模型合成,每幅插画均配有十一种语言的提示词(英语、简体中文、繁体中文、日语、韩语、德语、法语、西班牙语、巴西葡萄牙语、意大利语和俄语)。图像涵盖八个广泛的科学类别——生物医学、化学、材料、电子、环境、人工智能系统、物理学以及一个长的"其他"类别——主要由gemini-2.5-flash-image和gemini-3-pro-image-preview模型系列生成。与文献中占主导地位的通用文本到图像语料库不同,SciDraw-6K专为科学插画类型而构建:包括原理示意图、机制图、目录图以及概念海报。我们描述了构建流程,报告了数据集统计信息,并记录了其作为公共科学绘图服务sci-draw.com基础数据的使用情况。该数据集已发布,以支持多语言文本到图像研究、领域适应的扩散微调以及用于科学可视化的提示工程研究。
数据集:https://huggingface.co/datasets/SciDrawAI/SciDraw-6K
代码:https://github.com/SciDrawAI/scidraw-6k