This paper introduces a generative model designed for multimodal control over text-to-image foundation generative AI models such as Stable Diffusion, specifically tailored for engineering design synthesis. Our model proposes parametric, image, and text control modalities to enhance design precision and diversity. Firstly, it handles both partial and complete parametric inputs using a diffusion model that acts as a design autocomplete co-pilot, coupled with a parametric encoder to process the information. Secondly, the model utilizes assembly graphs to systematically assemble input component images, which are then processed through a component encoder to capture essential visual data. Thirdly, textual descriptions are integrated via CLIP encoding, ensuring a comprehensive interpretation of design intent. These diverse inputs are synthesized through a multimodal fusion technique, creating a joint embedding that acts as the input to a module inspired by ControlNet. This integration allows the model to apply robust multimodal control to foundation models, facilitating the generation of complex and precise engineering designs. This approach broadens the capabilities of AI-driven design tools and demonstrates significant advancements in precise control based on diverse data modalities for enhanced design generation.
翻译:本文提出一种生成模型,专为对Stable Diffusion等文本到图像基础生成式人工智能模型实施多模态控制而设计,特别适用于工程设计综合任务。该模型通过参数化、图像与文本三种控制模态来提升设计精度与多样性。首先,模型采用扩散模型处理部分或完整的参数化输入,该扩散模型充当设计自动补全的协同辅助工具,并结合参数化编码器处理输入信息。其次,模型利用装配图系统化整合输入组件图像,通过组件编码器提取关键视觉特征。第三,通过CLIP编码器融合文本描述,确保对设计意图的全面解析。这些异构输入通过多模态融合技术进行综合处理,生成联合嵌入向量作为受ControlNet启发的控制模块的输入。该集成方案使模型能够对基础模型实施鲁棒的多模态控制,从而生成复杂且精确的工程设计方案。本方法拓展了人工智能驱动设计工具的能力边界,在基于多源数据模态的精确控制方面实现了显著进展,有力推动了设计生成质量的提升。