We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence. Additional details and resources can be found in this URL: https://ninaneon.github.io/projectpage/
翻译:我们提出了IMDD-1M,这是首个包含100万对齐图文对的大规模工业多模态缺陷数据集,旨在推动制造业与质量检测领域的多模态学习。该数据集涵盖60余种材料类别和400多种缺陷类型的高分辨率真实缺陷图像,每张图像均配有专家验证的标注及细粒度文本描述,详细说明缺陷位置、严重程度和上下文属性。本数据集支持分类、分割、检索、描述生成和生成建模等广泛的应用场景。基于IMDD-1M,我们从头训练了一个专为工业场景定制的扩散式视觉-语言基础模型。该模型作为可泛化的基础架构,能通过轻量级微调高效适配特定领域。在仅需专用专家模型不足5%任务数据的情况下,其性能表现相当,凸显了数据高效的基础模型适配在工业检测与生成领域的潜力,为可扩展、领域自适应及知识驱动的制造智能开辟了道路。更多细节与资源请访问:https://ninaneon.github.io/projectpage/