SciDraw-6K: A Multilingual Scientific Illustration Dataset Generated by Google Gemini

We present SciDraw-6K, a curated dataset of 6,291 scientific illustrations synthesized by Google Gemini image-generation models, each paired with prompts in eleven languages (English, Simplified Chinese, Traditional Chinese, Japanese, Korean, German, French, Spanish, Brazilian Portuguese, Italian, and Russian). Images span eight broad scientific categories -- biomedical, chemistry, materials, electronics, environment, AI systems, physics, and a long "other" tail -- and are produced primarily by the gemini-2.5-flash-image and gemini-3-pro-image-preview model families. In contrast to general-purpose text-to-image corpora that dominate the literature, SciDraw-6K is purpose-built for the scientific illustration genre: schematic diagrams, mechanism figures, table-of-contents graphics, and conceptual posters. We describe the construction pipeline, report dataset statistics, and document its use as the substrate of sci-draw.com, a public scientific drawing service. The dataset is released to support multilingual text-to-image research, domain-adapted diffusion fine-tuning, and prompt-engineering studies for scientific visualization. Dataset: https://huggingface.co/datasets/SciDrawAI/SciDraw-6K Code: https://github.com/SciDrawAI/scidraw-6k

翻译：我们推出SciDraw-6K，一个由6,291幅科学插画组成的精选数据集，这些插画由Google Gemini图像生成模型合成，每幅插画均配有十一种语言的提示词（英语、简体中文、繁体中文、日语、韩语、德语、法语、西班牙语、巴西葡萄牙语、意大利语和俄语）。图像涵盖八个广泛的科学类别——生物医学、化学、材料、电子、环境、人工智能系统、物理学以及一个长的"其他"类别——主要由gemini-2.5-flash-image和gemini-3-pro-image-preview模型系列生成。与文献中占主导地位的通用文本到图像语料库不同，SciDraw-6K专为科学插画类型而构建：包括原理示意图、机制图、目录图以及概念海报。我们描述了构建流程，报告了数据集统计信息，并记录了其作为公共科学绘图服务sci-draw.com基础数据的使用情况。该数据集已发布，以支持多语言文本到图像研究、领域适应的扩散微调以及用于科学可视化的提示工程研究。数据集：https://huggingface.co/datasets/SciDrawAI/SciDraw-6K 代码：https://github.com/SciDrawAI/scidraw-6k

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

DeepSeek突然更新R1论文：暴增64页，能公开的全公开了

专知会员服务

21+阅读 · 1月8日

Nature 子刊 | SciToolAgent:知识图谱引导的科学工具智能体

专知会员服务

21+阅读 · 2025年11月1日

DeepSeek+DeepResearch 让科研像聊天一样简单，85页ppt

专知会员服务

48+阅读 · 2025年3月16日

国产大模型DeepSeek-V3一夜火爆全球，《DeepSeek-V3技术报告》，53页pdf

专知会员服务

23+阅读 · 2024年12月27日