A Method for Evaluating Deep Generative Models of Images via Assessing the Reproduction of High-order Spatial Context

from arxiv, The paper is under consideration at Pattern Recognition Letters. Early version with preliminary results was accepted for poster presentation at SPIE-MI 2022. This version on arXiv contains new and updated designs of stochastic models, their mathematical representations and the corresponding results. Data from the designed ensembles available at https://doi.org/10.7910/DVN/HHF4AF

Deep generative models (DGMs) have the potential to revolutionize diagnostic imaging. Generative adversarial networks (GANs) are one kind of DGM which are widely employed. The overarching problem with deploying GANs, and other DGMs, in any application that requires domain expertise in order to actually use the generated images is that there generally is not adequate or automatic means of assessing the domain-relevant quality of generated images. In this work, we demonstrate several objective tests of images output by two popular GAN architectures. We designed several stochastic context models (SCMs) of distinct image features that can be recovered after generation by a trained GAN. Several of these features are high-order, algorithmic pixel-arrangement rules which are not readily expressed in covariance matrices. We designed and validated statistical classifiers to detect specific effects of the known arrangement rules. We then tested the rates at which two different GANs correctly reproduced the feature context under a variety of training scenarios, and degrees of feature-class similarity. We found that ensembles of generated images can appear largely accurate visually, and show high accuracy in ensemble measures, while not exhibiting the known spatial arrangements. Furthermore, GANs trained on a spectrum of distinct spatial orders did not respect the given prevalence of those orders in the training data. The main conclusion is that SCMs can be engineered to quantify numerous errors, per image, that may not be captured in ensemble statistics but plausibly can affect subsequent use of the GAN-generated images.

翻译：深度生成模型（DGMs）有望革新诊断成像领域。生成对抗网络（GANs）是其中一类被广泛应用的DGM。在需要领域专业知识以实际使用生成图像的应用场景中，部署GANs及其他DGMs面临的根本问题是，通常缺乏充分或自动化的手段来评估生成图像与领域相关的质量。本研究针对两种主流GAN架构输出的图像，提出了一系列客观测试方法。我们设计了多种针对不同图像特征的随机上下文模型（SCMs），这些特征可在训练后的GAN生成过程中被恢复。其中若干特征属于高阶算法性像素排列规则，难以通过协方差矩阵直接表达。我们设计并验证了统计分类器，以检测已知排列规则的特定效应。随后，我们测试了两种不同GAN在多种训练场景及不同类别特征相似度条件下正确复现特征上下文的概率。研究发现，生成的图像集在视觉上可能基本准确，且在集合度量中呈现高精度，但实际并未体现已知的空间排列规则。此外，在具有不同空间序数分布谱系的训练数据上训练的GAN，未能尊重这些序数在训练数据中的实际出现频率。核心结论是：可定制SCMs以量化每张图像中可能无法通过集合统计捕获的多种误差，而这些误差很可能影响后续对GAN生成图像的使用。