U-Nets are among the most widely used architectures in computer vision, renowned for their exceptional performance in applications such as image segmentation, denoising, and diffusion modeling. However, a theoretical explanation of the U-Net architecture design has not yet been fully established. This paper introduces a novel interpretation of the U-Net architecture by studying certain generative hierarchical models, which are tree-structured graphical models extensively utilized in both language and image domains. With their encoder-decoder structure, long skip connections, and pooling and up-sampling layers, we demonstrate how U-Nets can naturally implement the belief propagation denoising algorithm in such generative hierarchical models, thereby efficiently approximating the denoising functions. This leads to an efficient sample complexity bound for learning the denoising function using U-Nets within these models. Additionally, we discuss the broader implications of these findings for diffusion models in generative hierarchical models. We also demonstrate that the conventional architecture of convolutional neural networks (ConvNets) is ideally suited for classification tasks within these models. This offers a unified view of the roles of ConvNets and U-Nets, highlighting the versatility of generative hierarchical models in modeling complex data distributions across language and image domains.
翻译:U-Net是计算机视觉领域应用最广泛的架构之一,在图像分割、去噪及扩散建模等任务中表现卓越。然而,关于U-Net架构设计的理论解释尚未完全建立。本文通过研究特定生成式层级模型——一种在语言与图像领域广泛应用的树状结构图模型,为U-Net架构提出全新解释。结合其编码器-解码器结构、长跳跃连接以及池化与上采样层,我们展示了U-Net如何自然地实现此类生成式层级模型中的置信传播去噪算法,从而高效逼近去噪函数。这为此类模型中使用U-Net学习去噪函数提供了高效的样本复杂度边界。此外,我们探讨了这些发现对生成式层级模型中扩散模型的广泛启示,并论证传统卷积神经网络架构在此类模型的分类任务中具有天然适配性。这为ConvNet与U-Net的功能提供了统一视角,凸显了生成式层级模型在语言与图像领域建模复杂数据分布的多功能性。