Deep generative models, such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), Diffusion Models, and Transformers, have shown great promise in a variety of applications, including image and speech synthesis, natural language processing, and drug discovery. However, when applied to engineering design problems, evaluating the performance of these models can be challenging, as traditional statistical metrics based on likelihood may not fully capture the requirements of engineering applications. This paper doubles as a review and a practical guide to evaluation metrics for deep generative models (DGMs) in engineering design. We first summarize well-accepted `classic' evaluation metrics for deep generative models grounded in machine learning theory and typical computer science applications. Using case studies, we then highlight why these metrics seldom translate well to design problems but see frequent use due to the lack of established alternatives. Next, we curate a set of design-specific metrics which have been proposed across different research communities and can be used for evaluating deep generative models. These metrics focus on unique requirements in design and engineering, such as constraint satisfaction, functional performance, novelty, and conditioning. We structure our review and discussion as a set of practical selection criteria and usage guidelines. Throughout our discussion, we apply the metrics to models trained on simple 2-dimensional example problems. Finally, to illustrate the selection process and classic usage of the presented metrics, we evaluate three deep generative models on a multifaceted bicycle frame design problem considering performance target achievement, design novelty, and geometric constraints. We publicly release the code for the datasets, models, and metrics used throughout the paper at decode.mit.edu/projects/metrics/.
翻译:深度生成模型,如变分自编码器(VAEs)、生成对抗网络(GANs)、扩散模型和Transformer,在图像与语音合成、自然语言处理以及药物发现等众多应用中展现出巨大潜力。然而,当这些模型应用于工程设计问题时,其性能评估面临挑战,因为基于似然的传统统计指标可能无法完全捕捉工程应用的需求。本文兼具综述与实践指南性质,聚焦工程设计领域深度生成模型(DGMs)的评估指标。我们首先梳理了基于机器学习理论与典型计算机科学应用的、广受认可的深度生成模型“经典”评估指标。通过案例研究,我们揭示了这些指标为何在设计中常难以有效应用,却因缺乏既定替代方案而被频繁使用。随后,我们汇集了不同研究社群提出的、可用于评估深度生成模型的设计特异性指标。这些指标聚焦于设计与工程中的独特需求,如约束满足、功能性能、新颖性以及条件控制。我们将综述与讨论组织为一套实用选择标准与使用指南。在整个讨论中,我们将指标应用于基于简单二维示例问题训练的模型。最后,为阐释所呈现指标的选择流程与经典用法,我们以一个多方面的自行车车架设计问题为例,评估了三种深度生成模型,涉及性能目标达成度、设计新颖性及几何约束。我们在decode.mit.edu/projects/metrics/上公开发布了本文所用数据集、模型及指标的代码。