White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?

from arxiv, This paper integrates the works arXiv:2306.01129 and arXiv:2308.16271, as well as this under-review work: https://openreview.net/forum?id=PvyOYleymy into a complete story. In this paper, we improve the writing and organization, and also add conceptual, empirical, and theoretical improvements over the previous work

In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information gain and extrinsic sparsity of the learned representation. From this perspective, popular deep network architectures, including transformers, can be viewed as realizing iterative schemes to optimize this measure. Particularly, we derive a transformer block from alternating optimization on parts of this objective: the multi-head self-attention operator compresses the representation by implementing an approximate gradient descent step on the coding rate of the features, and the subsequent multi-layer perceptron sparsifies the features. This leads to a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. We show, by way of a novel connection between denoising and compression, that the inverse to the aforementioned compressive encoding can be realized by the same class of CRATE architectures. Thus, the so-derived white-box architectures are universal to both encoders and decoders. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets, and achieve performance very close to highly engineered transformer-based models: ViT, MAE, DINO, BERT, and GPT2. We believe the proposed computational framework demonstrates great potential in bridging the gap between theory and practice of deep learning, from a unified perspective of data compression. Code is available at: https://ma-lab-berkeley.github.io/CRATE .

翻译：本文认为，表示学习的自然目标是压缩并转换数据分布（如标记集合），使其成为支撑在不相交子空间上的低维高斯混合。这种表示的质量可通过一个称为稀疏率压缩（sparse rate reduction）的原则性指标来评估，该指标同时最大化学习表示的内在信息增益和外在稀疏性。从这一视角出发，流行的深度网络架构（包括Transformer）可视为实现该目标度量优化迭代方案的具体化。特别地，我们通过对该目标的部分进行交替优化推导出Transformer模块：多头自注意力机制通过对特征编码率执行近似梯度下降步骤来压缩表示，而随后的多层感知机则对特征进行稀疏化。由此产生了一类数学上完全可解释的白盒类Transformer深度网络架构，命名为CRATE。通过去噪与压缩之间的新颖联系，我们证明前述压缩编码的逆过程可由相同架构类别的CRATE实现。因此，所推导出的白盒架构对编码器和解码器均具有通用性。实验表明，尽管这些网络结构简洁，但它们确实能够学习压缩并稀疏化大规模真实世界图像和文本数据集的表示，其性能与高度工程化的Transformer模型（ViT、MAE、DINO、BERT、GPT2）非常接近。我们相信，所提出的计算框架从数据压缩的统一视角，在弥合深度学习理论与实践差距方面展现出巨大潜力。代码开源地址：https://ma-lab-berkeley.github.io/CRATE

相关内容

白盒

关注 0

白盒测试（也称为透明盒测试，玻璃盒测试，透明盒测试和结构测试）是一种软件测试方法，用于测试应用程序的内部结构或功能，而不是其功能（即黑盒测试）。在白盒测试中，系统的内部视角以及编程技能被用来设计测试用例。测试人员选择输入以遍历代码的路径并确定预期的输出。这类似于测试电路中的节点，在线测试（ICT）。白盒测试可以应用于软件测试过程的单元，集成和系统级别。尽管传统的测试人员倾向于将白盒测试视为在单元级别进行的，但如今它已越来越频繁地用于集成和系统测试。它可以测试单元内的路径，集成期间单元之间的路径以及系统级测试期间子系统之间的路径。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日