Transformers achieve superior performance on many tasks, but impose heavy compute and memory requirements during inference. This inference can be made more efficient by partitioning the process across multiple devices, which, in turn, requires compressing its intermediate representations. We introduce a principled rate-distortion-based framework for lossy compression that learns compact encodings that explicitly trade bitrate for accuracy. Experiments on language benchmarks show that the simplest of the proposed codecs achieves substantial rate savings, outperforming more complex methods. We characterize and analyze the rate-distortion behaviour of transformers, offering a unified lens for understanding performance in representation coding. This formulation extends information-theoretic concepts to derive bounds on the achievable rate of learnable codecs. For different architectures and tasks, we empirically demonstrate that their rates are driven by these bounds, adding to the explainability of the formulations.
翻译:Transformer在许多任务上取得了优越性能,但在推理过程中会带来巨大的计算和内存需求。通过将推理过程跨多个设备进行划分(这需要压缩其中间表示),可以提升推理效率。我们提出了一种基于率失真的有损压缩框架,该框架学习紧凑编码,显式地在比特率与准确率之间进行权衡。在语言基准上的实验表明,所提出编解码器的最简形式可实现显著的比特率节省,效果优于更复杂的方法。我们描述并分析了Transformer的率失真行为,为理解表示编码中的性能提供了统一视角。该公式将信息论概念扩展至可学习编解码器,推导出可达比特率的界。针对不同架构与任务,我们的实证表明其比特率受这些界驱动,从而增强了公式的可解释性。