On Exact Bit-level Reversible Transformers Without Changing Architectures

In the literature, various reversible deep neural networks (DNN) models have been proposed to reduce memory consumption or improve data-throughput in the training process. However, almost all existing reversible DNNs either are constrained to have special structures or are constructed by modifying the original DNN architectures considerably to enable reversibility. In this work, we propose exact bit-level reversible transformers without changing the architectures in the inference procedure. The basic idea is to first treat each transformer block as the Euler integration approximation for solving an ordinary differential equation (ODE) and then incorporate the technique of bidirectional integration approximation (BDIA) (see [26]) for BDIA-based diffusion inversion) into the neural architecture together with activation quantization to make it exactly bit-level reversible, referred to as BDIA-transformer. In the training process, we let a hyper-parameter $\gamma$ in BDIA-transformer randomly take one of the two values $\{0.5, -0.5\}$ per transformer block for averaging two consecutive integration approximations, which regularizes the models for improving the validation accuracy. Light-weight side information per transformer block is required to be stored in the forward process to account for binary quantization loss to enable exact bit-level reversibility. In the inference procedure, the expectation $\mathbb{E}(\gamma)=0$ is taken to make the resulting architectures of BDIA-transformer be identical to transformers up to activation quantization. Empirical study indicates that BDIA-transformers outperform their original counterparts notably due to the regularization effect of the $\gamma$ parameter.

翻译：在现有文献中，已提出多种可逆深度神经网络（DNN）模型以降低训练过程中的内存消耗或提升数据吞吐量。然而，几乎所有现存的可逆DNN要么受限于特殊结构，要么需要通过大幅修改原始DNN架构以实现可逆性。本研究提出一种在推理过程中不改变架构的精确比特级可逆Transformer。其核心思想是：首先将每个Transformer块视为求解常微分方程（ODE）的欧拉积分近似，然后将双向积分近似（BDIA）技术（参见[26]中基于BDIA的扩散反转方法）与激活量化相结合融入神经架构，从而实现精确的比特级可逆性，该方法称为BDIA-Transformer。在训练过程中，我们使BDIA-Transformer中的超参数$\gamma$在每个Transformer块随机取$\{0.5, -0.5\}$中的任一值，以平均两个连续的积分近似，从而通过正则化提升模型验证准确率。在前向过程中需要存储每个Transformer块的轻量级边信息以补偿二值量化损失，从而实现精确的比特级可逆性。在推理过程中，取期望值$\mathbb{E}(\gamma)=0$使得BDIA-Transformer的最终架构与经过激活量化的Transformer保持一致。实验研究表明，由于$\gamma$参数的正则化效应，BDIA-Transformer的性能显著优于原始模型。