In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information gain and extrinsic sparsity of the learned representation. From this perspective, popular deep network architectures, including transformers, can be viewed as realizing iterative schemes to optimize this measure. Particularly, we derive a transformer block from alternating optimization on parts of this objective: the multi-head self-attention operator compresses the representation by implementing an approximate gradient descent step on the coding rate of the features, and the subsequent multi-layer perceptron sparsifies the features. This leads to a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. We show, by way of a novel connection between denoising and compression, that the inverse to the aforementioned compressive encoding can be realized by the same class of CRATE architectures. Thus, the so-derived white-box architectures are universal to both encoders and decoders. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets, and achieve performance very close to highly engineered transformer-based models: ViT, MAE, DINO, BERT, and GPT2. We believe the proposed computational framework demonstrates great potential in bridging the gap between theory and practice of deep learning, from a unified perspective of data compression. Code is available at: https://ma-lab-berkeley.github.io/CRATE .
翻译:本文认为,表示学习的自然目标是压缩并转换数据分布(如标记集合),使其成为支撑在不相交子空间上的低维高斯混合。这种表示的质量可通过一个称为稀疏率压缩(sparse rate reduction)的原则性指标来评估,该指标同时最大化学习表示的内在信息增益和外在稀疏性。从这一视角出发,流行的深度网络架构(包括Transformer)可视为实现该目标度量优化迭代方案的具体化。特别地,我们通过对该目标的部分进行交替优化推导出Transformer模块:多头自注意力机制通过对特征编码率执行近似梯度下降步骤来压缩表示,而随后的多层感知机则对特征进行稀疏化。由此产生了一类数学上完全可解释的白盒类Transformer深度网络架构,命名为CRATE。通过去噪与压缩之间的新颖联系,我们证明前述压缩编码的逆过程可由相同架构类别的CRATE实现。因此,所推导出的白盒架构对编码器和解码器均具有通用性。实验表明,尽管这些网络结构简洁,但它们确实能够学习压缩并稀疏化大规模真实世界图像和文本数据集的表示,其性能与高度工程化的Transformer模型(ViT、MAE、DINO、BERT、GPT2)非常接近。我们相信,所提出的计算框架从数据压缩的统一视角,在弥合深度学习理论与实践差距方面展现出巨大潜力。代码开源地址:https://ma-lab-berkeley.github.io/CRATE