Transformers achieve superior performance on many tasks, but impose heavy compute and memory requirements during inference. This inference can be made more efficient by partitioning the process across multiple devices, which, in turn, requires compressing its intermediate representations. In this work, we introduce a principled rate-distortion-based framework for lossy compression that learns compact encodings that explicitly trade off bitrate against accuracy. Experiments on language benchmarks show that the proposed codec achieves substantial savings with improved accuracy in some cases, outperforming more complex baseline methods. We characterize and analyze the rate-distortion performance of transformers, offering a unified lens for understanding performance in representation coding. This formulation extends information-theoretic concepts to define the gap between rate and entropy, and derive some of its bounds. We further develop probably approximately correct (PAC)-style bounds for estimating this gap. For different architectures and tasks, we empirically demonstrate that their rates are driven by these bounds, adding to the explainability of the formulation.
翻译:Transformer在许多任务上实现了卓越性能,但在推理过程中需要巨大的计算和内存开销。通过将推理过程划分到多个设备上执行可以提高效率,而这又需要对其中间表示进行压缩。本研究提出了一种基于率失真理论的原理性框架,用于学习在比特率与精度之间进行显式权衡的紧凑编码。在语言基准测试上的实验表明,所提出的编解码器在实现显著压缩的同时,在某些情况下甚至提升了精度,其性能优于更复杂的基线方法。我们对Transformer的率失真性能进行了表征与分析,为理解表示编码的性能提供了统一的理论视角。该框架将信息论概念扩展至率与熵之间差距的定义,并推导了其部分边界。我们进一步提出了概率近似正确(PAC)风格的边界来估计这一差距。针对不同架构与任务,我们通过实证表明其率值受这些边界驱动,从而增强了该理论框架的可解释性。