Learning from Invalid Data: On Constraint Satisfaction in Generative Models

Generative models have demonstrated impressive results in vision, language, and speech. However, even with massive datasets, they struggle with precision, generating physically invalid or factually incorrect data. This is particularly problematic when the generated data must satisfy constraints, for example, to meet product specifications in engineering design or to adhere to the laws of physics in a natural scene. To improve precision while preserving diversity and fidelity, we propose a novel training mechanism that leverages datasets of constraint-violating data points, which we consider invalid. Our approach minimizes the divergence between the generative distribution and the valid prior while maximizing the divergence with the invalid distribution. We demonstrate how generative models like GANs and DDPMs that we augment to train with invalid data vastly outperform their standard counterparts which solely train on valid data points. For example, our training procedure generates up to 98 % fewer invalid samples on 2D densities, improves connectivity and stability four-fold on a stacking block problem, and improves constraint satisfaction by 15 % on a structural topology optimization benchmark in engineering design. We also analyze how the quality of the invalid data affects the learning procedure and the generalization properties of models. Finally, we demonstrate significant improvements in sample efficiency, showing that a tenfold increase in valid samples leads to a negligible difference in constraint satisfaction, while less than 10 % invalid samples lead to a tenfold improvement. Our proposed mechanism offers a promising solution for improving precision in generative models while preserving diversity and fidelity, particularly in domains where constraint satisfaction is critical and data is limited, such as engineering design, robotics, and medicine.

翻译：生成模型在视觉、语言和语音领域已展现出令人瞩目的成果。然而，即便拥有大规模数据集，它们在精确性方面仍存在不足，会生成物理上无效或事实上不准确的数据。当生成数据必须满足约束条件时（例如在工程设计中符合产品规格，或在自然场景中遵循物理定律），这一问题尤为突出。为在保持多样性和保真度的同时提升精确性，我们提出一种新颖的训练机制，该机制利用违反约束的数据点（我们将其视为无效数据）构成的数据集。我们的方法在最小化生成分布与有效先验之间散度的同时，最大化其与无效分布之间的散度。我们展示了如何将GANs和DDPMs等生成模型增强为可基于无效数据训练，并使其大幅超越仅基于有效数据点训练的常规对应模型。例如，我们的训练流程在二维密度分布上生成的无效样本最多减少98%；在积木堆叠问题中，连接性与稳定性提升四倍；在工程设计中的结构拓扑优化基准上，约束满足度提高15%。我们还分析了无效数据质量对学习过程及模型泛化特性的影响。最后，我们证明了样本效率的显著提升：有效样本量增加十倍对约束满足度的影响微乎其微，而不足10%的无效样本即可带来十倍的改进。我们的机制为提升生成模型精确性（同时保持多样性与保真度）提供了有前景的方案，尤其在工程设计、机器人技术和医学等约束满足至关重要且数据有限的领域。