In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information gain and extrinsic sparsity of the learned representation. From this perspective, popular deep network architectures, including transformers, can be viewed as realizing iterative schemes to optimize this measure. Particularly, we derive a transformer block from alternating optimization on parts of this objective: the multi-head self-attention operator compresses the representation by implementing an approximate gradient descent step on the coding rate of the features, and the subsequent multi-layer perceptron sparsifies the features. This leads to a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. We show, by way of a novel connection between denoising and compression, that the inverse to the aforementioned compressive encoding can be realized by the same class of CRATE architectures. Thus, the so-derived white-box architectures are universal to both encoders and decoders. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets, and achieve performance very close to highly engineered transformer-based models: ViT, MAE, DINO, BERT, and GPT2. We believe the proposed computational framework demonstrates great potential in bridging the gap between theory and practice of deep learning, from a unified perspective of data compression. Code is available at: https://ma-lab-berkeley.github.io/CRATE .
翻译:本文主张,表示学习的一个自然目标是对数据分布(例如标记集)进行压缩与变换,使其转化为由不相交子空间支撑的低维高斯混合分布。此类表示的质量可通过一种名为稀疏率压缩的有原则度量来评估,该度量同时最大化学习表示的内在信息增益和外在稀疏性。从这一视角出发,包括Transformer在内的流行深度网络架构可视为优化该度量的迭代方案。特别地,我们通过交替优化目标的各个部分推导出Transformer块:多头自注意力算子通过对特征编码率执行近似梯度下降步骤实现表示压缩,随后的多层感知机则对特征进行稀疏化处理。由此产生了一族数学上完全可解释的白盒类Transformer深度网络架构,命名为CRATE。我们通过去噪与压缩之间的新颖联系证明,上述压缩编码的逆过程可由同一类CRATE架构实现。因此,所推导的白盒架构对编码器和解码器具有普适性。实验表明,尽管这些网络结构简单,但确实能够学习压缩和稀疏化大规模真实世界图像与文本数据集的表示,其性能与精心设计的基于Transformer的模型(ViT、MAE、DINO、BERT和GPT2)非常接近。我们相信,所提出的计算框架从数据压缩的统一视角,展现出弥合深度学习理论与实践鸿沟的巨大潜力。代码可访问:https://ma-lab-berkeley.github.io/CRATE。