Deep neural networks can learn powerful prior probability models for images, as evidenced by the high-quality generations obtained with recent score-based diffusion methods. But the means by which these networks capture complex global statistical structure, apparently without suffering from the curse of dimensionality, remain a mystery. To study this, we incorporate diffusion methods into a multi-scale decomposition, reducing dimensionality by assuming a stationary local Markov model for wavelet coefficients conditioned on coarser-scale coefficients. We instantiate this model using convolutional neural networks (CNNs) with local receptive fields, which enforce both the stationarity and Markov properties. Global structures are captured using a CNN with receptive fields covering the entire (but small) low-pass image. We test this model on a dataset of face images, which are highly non-stationary and contain large-scale geometric structures. Remarkably, denoising, super-resolution, and image synthesis results all demonstrate that these structures can be captured with significantly smaller conditioning neighborhoods than required by a Markov model implemented in the pixel domain. Our results show that score estimation for large complex images can be reduced to low-dimensional Markov conditional models across scales, alleviating the curse of dimensionality.
翻译:深度神经网络能够学习图像强大的先验概率模型,近期基于得分的扩散方法所生成的高质量图像即为明证。但这些网络究竟如何捕获复杂的全局统计结构,却似乎不受维度灾难影响,至今仍是一个谜。为探究此问题,我们将扩散方法融入多尺度分解中,通过假设小波系数在给定粗尺度系数的条件下服从平稳局部马尔可夫模型来降低维度。我们利用具有局部感受野的卷积神经网络实例化该模型,从而同时强制执行平稳性和马尔可夫特性。采用感受野覆盖整个(但较小的)低通图像的卷积神经网络来捕获全局结构。我们在人脸图像数据集上测试该模型,这些图像具有高度非平稳性且包含大尺度几何结构。值得注意的是,去噪、超分辨率和图像合成结果均表明,与在像素域实现的马尔可夫模型所需的邻域相比,这些结构可通过显著更小的条件邻域来捕获。我们的结果表明,大型复杂图像的得分估计可简化为跨尺度的低维马尔可夫条件模型,从而缓解维度灾难。