White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?

from arxiv, This paper integrates the works arXiv:2306.01129 and arXiv:2308.16271 into a complete story. In this paper, we improve the writing and organization, and also add conceptual, empirical, and theoretical improvements over the previous work. V2: small typo fixes and formatting improvements

In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information gain and extrinsic sparsity of the learned representation. From this perspective, popular deep network architectures, including transformers, can be viewed as realizing iterative schemes to optimize this measure. Particularly, we derive a transformer block from alternating optimization on parts of this objective: the multi-head self-attention operator compresses the representation by implementing an approximate gradient descent step on the coding rate of the features, and the subsequent multi-layer perceptron sparsifies the features. This leads to a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. We show, by way of a novel connection between denoising and compression, that the inverse to the aforementioned compressive encoding can be realized by the same class of CRATE architectures. Thus, the so-derived white-box architectures are universal to both encoders and decoders. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets, and achieve performance very close to highly engineered transformer-based models: ViT, MAE, DINO, BERT, and GPT2. We believe the proposed computational framework demonstrates great potential in bridging the gap between theory and practice of deep learning, from a unified perspective of data compression. Code is available at: https://ma-lab-berkeley.github.io/CRATE .

翻译：本文主张，表示学习的一个自然目标是对数据分布（例如标记集）进行压缩与变换，使其转化为由不相交子空间支撑的低维高斯混合分布。此类表示的质量可通过一种名为稀疏率压缩的有原则度量来评估，该度量同时最大化学习表示的内在信息增益和外在稀疏性。从这一视角出发，包括Transformer在内的流行深度网络架构可视为优化该度量的迭代方案。特别地，我们通过交替优化目标的各个部分推导出Transformer块：多头自注意力算子通过对特征编码率执行近似梯度下降步骤实现表示压缩，随后的多层感知机则对特征进行稀疏化处理。由此产生了一族数学上完全可解释的白盒类Transformer深度网络架构，命名为CRATE。我们通过去噪与压缩之间的新颖联系证明，上述压缩编码的逆过程可由同一类CRATE架构实现。因此，所推导的白盒架构对编码器和解码器具有普适性。实验表明，尽管这些网络结构简单，但确实能够学习压缩和稀疏化大规模真实世界图像与文本数据集的表示，其性能与精心设计的基于Transformer的模型（ViT、MAE、DINO、BERT和GPT2）非常接近。我们相信，所提出的计算框架从数据压缩的统一视角，展现出弥合深度学习理论与实践鸿沟的巨大潜力。代码可访问：https://ma-lab-berkeley.github.io/CRATE。

相关内容

白盒

关注 0

白盒测试（也称为透明盒测试，玻璃盒测试，透明盒测试和结构测试）是一种软件测试方法，用于测试应用程序的内部结构或功能，而不是其功能（即黑盒测试）。在白盒测试中，系统的内部视角以及编程技能被用来设计测试用例。测试人员选择输入以遍历代码的路径并确定预期的输出。这类似于测试电路中的节点，在线测试（ICT）。白盒测试可以应用于软件测试过程的单元，集成和系统级别。尽管传统的测试人员倾向于将白盒测试视为在单元级别进行的，但如今它已越来越频繁地用于集成和系统测试。它可以测试单元内的路径，集成期间单元之间的路径以及系统级测试期间子系统之间的路径。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日