Network Memory Footprint Compression Through Jointly Learnable Codebooks and Mappings

The massive interest in deep neural networks (DNNs) for both computer vision and natural language processing has been sparked by the growth in computational power. However, this led to an increase in the memory footprint, to a point where it can be challenging to simply load a model on commodity devices such as mobile phones. To address this limitation, quantization is a favored solution as it maps high precision tensors to a low precision, memory efficient format. In terms of memory footprint reduction, its most effective variants are based on codebooks. These methods, however, suffer from two limitations. First, they either define a single codebook for each tensor, or use a memory-expensive mapping to multiple codebooks. Second, gradient descent optimization of the mapping favors jumps toward extreme values, hence not defining a proximal search. In this work, we propose to address these two limitations. First, we initially group similarly distributed neurons and leverage the re-ordered structure to either apply different scale factors to the different groups, or map weights that fall in these groups to several codebooks, without any mapping overhead. Second, stemming from this initialization, we propose a joint learning of the codebook and weight mappings that bears similarities with recent gradient-based post-training quantization techniques. Third, drawing estimation from straight-through estimation techniques, we introduce a novel gradient update definition to enable a proximal search of the codebooks and their mappings. The proposed jointly learnable codebooks and mappings (JLCM) method allows a very efficient approximation of any DNN: as such, a Llama 7B can be compressed down to 2Go and loaded on 5-year-old smartphones.

翻译：深度神经网络（DNNs）在计算机视觉和自然语言处理领域的广泛关注得益于计算能力的提升。然而，这也导致内存占用急剧增加，以至于在手机等商用设备上直接加载模型都面临挑战。量化作为一种优选解决方案，通过将高精度张量映射为低精度、内存高效格式来缓解这一限制。在内存压缩方面，基于码本的方法是最有效的量化变体之一。但此类方法存在两大局限：其一，它们要么为每个张量定义单一码本，要么使用高内存开销的映射策略处理多码本；其二，梯度下降优化映射时倾向于向极端值跳跃，而非定义近端搜索。本研究针对这两大局限提出解决方案：首先，我们对具有相似分布的神经元进行初始分组，利用重排序结构为不同组施加不同缩放因子，或将各组权重映射到多个码本且无需额外映射开销；其次，基于该初始化方法，我们提出码本与权重映射的联合学习机制，该机制与近期基于梯度的后训练量化技术具有相似性；第三，借鉴直通估计技术，我们引入新型梯度更新定义，实现码本及其映射的近端搜索。所提出的联合可学习码本与映射（JLCM）方法能够高效逼近任意DNN：例如，Llama 7B模型可被压缩至2GB并加载至五年前的智能手机上运行。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日