The massive interest in deep neural networks (DNNs) for both computer vision and natural language processing has been sparked by the growth in computational power. However, this led to an increase in the memory footprint, to a point where it can be challenging to simply load a model on commodity devices such as mobile phones. To address this limitation, quantization is a favored solution as it maps high precision tensors to a low precision, memory efficient format. In terms of memory footprint reduction, its most effective variants are based on codebooks. These methods, however, suffer from two limitations. First, they either define a single codebook for each tensor, or use a memory-expensive mapping to multiple codebooks. Second, gradient descent optimization of the mapping favors jumps toward extreme values, hence not defining a proximal search. In this work, we propose to address these two limitations. First, we initially group similarly distributed neurons and leverage the re-ordered structure to either apply different scale factors to the different groups, or map weights that fall in these groups to several codebooks, without any mapping overhead. Second, stemming from this initialization, we propose a joint learning of the codebook and weight mappings that bears similarities with recent gradient-based post-training quantization techniques. Third, drawing estimation from straight-through estimation techniques, we introduce a novel gradient update definition to enable a proximal search of the codebooks and their mappings. The proposed jointly learnable codebooks and mappings (JLCM) method allows a very efficient approximation of any DNN: as such, a Llama 7B can be compressed down to 2Go and loaded on 5-year-old smartphones.
翻译:深度神经网络(DNNs)在计算机视觉和自然语言处理领域的广泛关注得益于计算能力的提升。然而,这也导致内存占用急剧增加,以至于在手机等商用设备上直接加载模型都面临挑战。量化作为一种优选解决方案,通过将高精度张量映射为低精度、内存高效格式来缓解这一限制。在内存压缩方面,基于码本的方法是最有效的量化变体之一。但此类方法存在两大局限:其一,它们要么为每个张量定义单一码本,要么使用高内存开销的映射策略处理多码本;其二,梯度下降优化映射时倾向于向极端值跳跃,而非定义近端搜索。本研究针对这两大局限提出解决方案:首先,我们对具有相似分布的神经元进行初始分组,利用重排序结构为不同组施加不同缩放因子,或将各组权重映射到多个码本且无需额外映射开销;其次,基于该初始化方法,我们提出码本与权重映射的联合学习机制,该机制与近期基于梯度的后训练量化技术具有相似性;第三,借鉴直通估计技术,我们引入新型梯度更新定义,实现码本及其映射的近端搜索。所提出的联合可学习码本与映射(JLCM)方法能够高效逼近任意DNN:例如,Llama 7B模型可被压缩至2GB并加载至五年前的智能手机上运行。