Transformers achieve superior performance on many tasks, but impose heavy compute and memory requirements during inference. This inference can be made more efficient by partitioning the process across multiple devices, which, in turn, requires compressing its intermediate representations. We introduce a principled rate-distortion-based framework for lossy compression that learns compact encodings that explicitly trade bitrate for accuracy. Experiments on language benchmarks show that the simplest of the proposed codecs achieves substantial rate savings, outperforming more complex methods. We characterize and analyze the rate-distortion behaviour of transformers, offering a unified lens for understanding performance in representation coding. This formulation extends information-theoretic concepts to define the gap between rate and entropy, and derive some of its bounds. We further develop probably approximately correct (PAC)-style bounds for estimating this gap. For different architectures and tasks, we empirically demonstrate that their rates are driven by these bounds, adding to the explainability of the formulation.
翻译:Transformer模型在诸多任务上展现出卓越性能,但其推理过程对计算和存储资源需求巨大。通过将推理过程跨多个设备进行切分可提升效率,而这需要对其中间表示进行压缩。我们提出了一种基于率-失真理论的原理性有损压缩框架,通过学习紧凑编码显式地权衡比特率与精度。在语言基准上的实验表明,我们所提出的最简编解码器即可实现可观的比特率节省,性能优于更复杂的方法。我们刻画并分析了Transformer的率-失真特性,为理解表示编码中的性能表现提供了统一视角。该公式扩展了信息论概念来定义率与熵之间的差距,并推导了其部分界。我们进一步开发了用于估计该差距的PAC(可能近似正确)风格界。针对不同架构与任务,我们通过实验证明其率值受这些界驱动,从而增强了该公式的可解释性。