Sparser Training for On-Device Recommendation Systems

Recommender systems often rely on large embedding tables that map users and items to dense vectors of uniform size, leading to substantial memory consumption and inefficiencies. This is particularly problematic in memory-constrained environments like mobile and Web of Things (WoT) applications, where scalability and real-time performance are critical. Various research efforts have sought to address these issues. Although embedding pruning methods utilizing Dynamic Sparse Training (DST) stand out due to their low training and inference costs, consistent sparsity, and end-to-end differentiability, they face key challenges. Firstly, they typically initializes the mask matrix, which is used to prune redundant parameters, with random uniform sparse initialization. This strategy often results in suboptimal performance as it creates unstructured and inefficient connections. Secondly, they tend to favor the users/items sampled in the single batch immediately before weight exploration when they reactivate pruned parameters with large gradient magnitudes, which does not necessarily improve the overall performance. Thirdly, while they use sparse weights during forward passes, they still need to compute dense gradients during backward passes. In this paper, we propose SparseRec, an lightweight embedding method based on DST, to address these issues. Specifically, SparseRec initializes the mask matrix using Nonnegative Matrix Factorization. It accumulates gradients to identify the inactive parameters that can better improve the model performance after activation. Furthermore, it avoids dense gradients during backpropagation by sampling a subset of important vectors. Gradients are calculated only for parameters in this subset, thus maintaining sparsity during training in both forward and backward passes.

翻译：推荐系统通常依赖大型嵌入表，将用户和物品映射为统一维度的稠密向量，导致显著的内存消耗与效率低下。这在内存受限的环境中尤为突出，例如移动设备与物联网应用，其中可扩展性与实时性能至关重要。已有诸多研究致力于解决这些问题。尽管基于动态稀疏训练的嵌入剪枝方法因其低训练/推理成本、一致性稀疏特性及端到端可微性而备受关注，但仍面临关键挑战。首先，这些方法通常采用随机均匀稀疏初始化来生成用于剪枝冗余参数的掩码矩阵，该策略常导致次优性能，因其会形成非结构化且低效的连接。其次，在通过较大梯度幅值重新激活已剪枝参数时，它们倾向于优先处理权重探索前单一批次中采样的用户/物品，但这未必能提升整体性能。第三，尽管在前向传播中使用稀疏权重，它们在反向传播中仍需计算稠密梯度。本文提出SparseRec，一种基于动态稀疏训练的轻量化嵌入方法，以应对上述问题。具体而言，SparseRec使用非负矩阵分解初始化掩码矩阵；通过梯度累积识别激活后能更好提升模型性能的非活跃参数；此外，该方法通过采样重要向量子集避免反向传播中的稠密梯度计算，仅针对该子集中的参数计算梯度，从而在前向与反向传播中全程保持训练过程的稀疏性。