The top-k operator returns a k-sparse vector, where the non-zero values correspond to the k largest values of the input. Unfortunately, because it is a discontinuous function, it is difficult to incorporate in neural networks trained end-to-end with backpropagation. Recent works have considered differentiable relaxations, based either on regularization or perturbation techniques. However, to date, no approach is fully differentiable and sparse. In this paper, we propose new differentiable and sparse top-k operators. We view the top-k operator as a linear program over the permutahedron, the convex hull of permutations. We then introduce a p-norm regularization term to smooth out the operator, and show that its computation can be reduced to isotonic optimization. Our framework is significantly more general than the existing one and allows for example to express top-k operators that select values in magnitude. On the algorithmic side, in addition to pool adjacent violator (PAV) algorithms, we propose a new GPU/TPU-friendly Dykstra algorithm to solve isotonic optimization problems. We successfully use our operators to prune weights in neural networks, to fine-tune vision transformers, and as a router in sparse mixture of experts.
翻译:Top-k算子返回一个k-稀疏向量,其中非零值对应于输入的k个最大值。然而,由于它是一个不连续函数,难以集成到通过反向传播进行端到端训练的神经网络中。近期研究已提出基于正则化或扰动技术的可微松弛方法。但迄今为止,尚无方法能同时实现完全可微与稀疏性。本文提出了新的可微且稀疏的Top-k算子。我们将Top-k算子视为置换多面体(排列的凸包)上的线性规划问题,通过引入p-范数正则化项平滑该算子,并证明其计算可归结为保序优化问题。我们的框架显著优于现有方法,例如可表达按幅度选择数值的Top-k算子。在算法层面,除池化相邻违例者(PAV)算法外,我们提出了一种新型的GPU/TPU友好型Dykstra算法来解决保序优化问题。我们成功地将所提算子应用于神经网络权重剪枝、视觉Transformer微调以及稀疏专家混合模型中的路由器设计。