We provide a dataset for enabling Deep Generative Models (DGMs) in engineering design and propose methods to automate data labeling by utilizing large-scale foundation models. GeoBiked is curated to contain 4 355 bicycle images, annotated with structural and technical features and is used to investigate two automated labeling techniques: The utilization of consolidated latent features (Hyperfeatures) from image-generation models to detect geometric correspondences (e.g. the position of the wheel center) in structural images and the generation of diverse text descriptions for structural images. GPT-4o, a vision-language-model (VLM), is instructed to analyze images and produce diverse descriptions aligned with the system-prompt. By representing technical images as Diffusion-Hyperfeatures, drawing geometric correspondences between them is possible. The detection accuracy of geometric points in unseen samples is improved by presenting multiple annotated source images. GPT-4o has sufficient capabilities to generate accurate descriptions of technical images. Grounding the generation only on images leads to diverse descriptions but causes hallucinations, while grounding it on categorical labels restricts the diversity. Using both as input balances creativity and accuracy. Successfully using Hyperfeatures for geometric correspondence suggests that this approach can be used for general point-detection and annotation tasks in technical images. Labeling such images with text descriptions using VLMs is possible, but dependent on the models detection capabilities, careful prompt-engineering and the selection of input information. Applying foundation models in engineering design is largely unexplored. We aim to bridge this gap with a dataset to explore training, finetuning and conditioning DGMs in this field and suggesting approaches to bootstrap foundation models to process technical images.
翻译:我们提供了一个用于支持深度生成模型在工程设计中应用的数据集,并提出了利用大规模基础模型实现数据自动化标注的方法。GeoBiked数据集包含4,355张自行车图像,标注了结构特征与技术特征,并用于研究两种自动化标注技术:利用图像生成模型中的整合潜在特征(超特征)来检测结构图像中的几何对应关系(例如车轮中心的位置),以及为结构图像生成多样化的文本描述。我们指导视觉语言模型GPT-4o分析图像,并根据系统提示生成多样化的描述。通过将技术图像表示为扩散超特征,可以在图像之间建立几何对应关系。通过提供多张带标注的源图像,可以提高未见样本中几何点的检测精度。GPT-4o具备足够的能力为技术图像生成准确的描述。仅基于图像进行生成会导致描述多样化但产生幻觉,而仅基于类别标签进行生成则会限制多样性。将两者同时作为输入可以在创造性与准确性之间取得平衡。成功使用超特征进行几何对应检测表明,该方法可用于技术图像中的通用点检测与标注任务。使用视觉语言模型为这类图像添加文本描述是可行的,但其效果取决于模型的检测能力、精心的提示工程以及输入信息的选择。在工程设计中应用基础模型的研究尚不充分。我们旨在通过提供一个数据集来弥合这一差距,以探索在该领域中训练、微调和条件化深度生成模型的方法,并提出引导基础模型处理技术图像的途径。