U-Nets are among the most widely used architectures in computer vision, renowned for their exceptional performance in applications such as image segmentation, denoising, and diffusion modeling. However, a theoretical explanation of the U-Net architecture design has not yet been fully established. This paper introduces a novel interpretation of the U-Net architecture by studying certain generative hierarchical models, which are tree-structured graphical models extensively utilized in both language and image domains. With their encoder-decoder structure, long skip connections, and pooling and up-sampling layers, we demonstrate how U-Nets can naturally implement the belief propagation denoising algorithm in such generative hierarchical models, thereby efficiently approximating the denoising functions. This leads to an efficient sample complexity bound for learning the denoising function using U-Nets within these models. Additionally, we discuss the broader implications of these findings for diffusion models in generative hierarchical models. We also demonstrate that the conventional architecture of convolutional neural networks (ConvNets) is ideally suited for classification tasks within these models. This offers a unified view of the roles of ConvNets and U-Nets, highlighting the versatility of generative hierarchical models in modeling complex data distributions across language and image domains.
翻译:U-Net是计算机视觉中应用最广泛的架构之一,因其在图像分割、去噪和扩散建模等任务中的卓越表现而闻名。然而,关于U-Net架构设计的理论解释尚未完全建立。本文通过研究特定的生成式层次模型(即广泛用于语言和图像领域的树状结构图模型),提出了对U-Net架构的新颖解释。凭借其编码器-解码器结构、长跳跃连接以及池化和上采样层,我们展示了U-Net如何自然地实现此类生成式层次模型中的置信传播去噪算法,从而高效逼近去噪函数。这为在该类模型中使用U-Net学习去噪函数提供了高效的样本复杂度界。此外,我们讨论了这些发现对生成式层次模型中扩散模型的更广泛意义。我们还证明了传统卷积神经网络架构非常适合这些模型中的分类任务。这为ConvNet和U-Net的角色提供了统一视角,突显了生成式层次模型在语言和图像领域建模复杂数据分布的多功能性。